oh-my-agent v2: Nine New Skills, First-Class Cursor, and an 80/100 Benchmark
oh-my-agent v2 adds nine new skills, promotes Cursor to a first-class vendor, and ships a benchmark scoring 80/100. A measured look at whether it fixes the agent failures developers actually hit.
If you have watched an AI coding agent install a package version that does not exist in your lockfile, or ship a function that fails your own lint config on the first commit, you already understand the gap oh-my-agent v2 is built to close. The framework’s second major release adds nine new skills, promotes Cursor to a first-class vendor, and ships a benchmark that scores the toolkit 80 out of 100.
Here is what v2 changes, and how to decide whether the additions target real failure modes or just expand the surface area.
What oh-my-agent does
oh-my-agent is a skill layer that sits between you and whatever AI coding agent you run. The name borrows from oh-my-zsh, and the analogy holds: instead of configuring shell behavior, you configure agent behavior with reusable, composable instruction modules the project calls skills.
The problem it targets is consistency. A raw coding agent keeps no durable memory of your project’s conventions. Ask it to add a dependency and it may guess a version that is not in your lockfile. Ask it to write a component and it may ignore the lint config sitting in your repo root. These are not edge cases — they are the default behavior of an agent that treats every request as a fresh context.
A skill in oh-my-agent is a packaged set of instructions and checks the agent loads when a task matches. One skill might force the agent to read your package.json and lockfile before proposing a version. Another might surface your linter rules before any code is written. The pitch is that you stop re-explaining the same constraints in every prompt.
The nine new skills in v2
The v2 release adds nine skills. Three are worth calling out, because they map to problems most teams hit within a week of adopting an agent.
deepsec handles security review. Instead of trusting the agent to remember secure patterns, the skill runs a structured pass over generated code, checking for the injection, secret-handling, and trust-boundary mistakes agents introduce when they optimize for making something work.
observability pushes the agent to add logging, metrics, and tracing as it writes code, rather than leaving instrumentation as a follow-up task that never happens.
docs drift detection is the one most teams underrate. When an agent changes a function signature or a config option, the matching documentation usually goes stale without anyone noticing. This skill flags the gap so docs and code stay in sync.
The remaining six skills round out areas like testing and project conventions. The pattern across all nine is the same: take a step a developer is supposed to do, and make it a non-optional part of the agent’s workflow instead of a hope.
Cursor becomes a first-class vendor
Earlier oh-my-agent releases were built around one agent and treated the rest as second-class. v2 changes the model. A vendor is the underlying agent that executes skills, and Cursor is now a first-class vendor, which means skills are tested against it and ship with Cursor-specific wiring rather than a generic fallback.
In practice, you can keep oh-my-agent’s skill definitions in one place and run them through Cursor’s agent without rewriting instructions per tool. For teams that have standardized on Cursor as their editor, that removes the main reason to maintain a separate, hand-rolled set of project rules.
Cursor
The AI-native editor that oh-my-agent v2 now supports as a first-class vendor, with tested skill wiring instead of a generic fallback.
Free tier; Pro $20/month
Affiliate link · We earn a commission at no cost to you.
First-class status is a maintenance commitment, not a one-time feature. The thing to watch over the next few releases is whether Cursor support keeps pace with the primary vendor or quietly drifts behind it — the usual failure pattern for multi-vendor tools.
What the 80/100 benchmark does and doesn’t tell you
v2 ships with a benchmark that scores the toolkit 80 out of 100. A published, repeatable number is useful on its own: it gives you a baseline to compare future releases against, and it signals the project is willing to measure itself instead of leaning on adjectives.
Treat the number as a starting point, not a verdict. A benchmark reflects the tasks its authors chose. An 80 on the project’s own suite tells you the skills behave as designed on that suite. It does not tell you how they perform on your codebase, your stack, or your conventions.
The honest read on v2: the release aims squarely at the most common, least glamorous agent failures — wrong versions, ignored configs, stale docs — rather than chasing a flashier capability. That is the right target. The open question is operational. Nine new skills is a lot of surface to keep working across two first-class vendors, and the real proof will be whether release three holds the line.
FAQ
Do I still need oh-my-agent if I already use Cursor? +
What does 'first-class vendor' actually mean? +
Will the skills help on any codebase? +
Related reading
2026-05-20
How to Build an Autonomous AI Coding Agent That Opens GitHub PRs Overnight
A practical breakdown of the plan-execute-verify loop behind an autonomous AI coding agent, and how to wire it to GitHub so an issue becomes a reviewable pull request overnight.
2026-05-20
Continual Harness: The Gemini Pokémon Agent That Rewrites Its Own Loop
How the Continual Harness pattern, from the Gemini Plays Pokémon and PokeAgent teams, lets an agent rewrite its own harness mid-run — plus how to apply that online-adaptation idea to autonomous agents you build.
2026-05-20
Apify Fingerprint Suite: Open-Source Browser Fingerprinting for Stealth Scrapers
Apify's fingerprint-suite generates statistically consistent browser fingerprints and injects them into Playwright or Puppeteer. How it works, how to wire it in, and when a scraper actually needs it.
2026-05-20
Judea Pearl's Ladder of Causation and the Limits of LLM Reasoning
Judea Pearl's three-rung causal hierarchy — association, intervention, counterfactual — explains why data-driven ML and LLMs hit a structural wall at causal reasoning, and what that means for agents and RAG.
2026-05-20
Optuna Tutorial: Automate Hyperparameter Tuning for ML Models in Python
How Optuna's define-by-run API, TPE sampler, and pruners automate hyperparameter tuning for scikit-learn, PyTorch, and TensorFlow models, with runnable Python code.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.