OpenAI Codex vs Claude Code: Hands-On Python Benchmark for Devs
We pointed Codex and Claude Code at the same Python codebase across refactoring, debugging, and agentic tasks. Here is what each tool shipped, where each one wins, and what the speed-vs-cost tradeoff actually looks like in practice.
OpenAI relaunched Codex this year as a full agentic CLI that lives in your terminal and talks to GPT-5 class models. Claude Code did the same thing for Anthropic, six months earlier. Both want to be the assistant you actually merge code from. We pointed both at the same Python project and tracked what each one shipped.
The codebase under test: a mid-sized Flask + SQLAlchemy service with a real pytest suite and a handful of slow, gnarly modules begging to be refactored. We ran identical prompts through both tools, on the same hardware, against the same git SHA, and rewound the worktree between runs so neither tool saw the other’s edits.
How we structured the test
We ran three kinds of tasks against each assistant, three trials per task per tool. Not enough trials for statistical certainty, but enough to catch behavior patterns that held across attempts.
Task A: refactor a roughly 400-line module that mixed request handling, DB access, and template rendering into a service layer plus thin route handlers. Success criteria: tests still green, no regressions in a smoke flow we recorded with httpx, and the resulting file structure passing ruff and mypy --strict cleanly.
Task B: fix three known bugs. One off-by-one in a pagination helper. One race condition in a background worker that only surfaced under concurrent load. One Unicode normalization bug in a search endpoint. We handed each assistant only the failing pytest output and the file path, with no hints about the fix.
Task C: an agentic workflow. “Add OpenTelemetry tracing across the request lifecycle, including DB spans, then write tests proving spans are emitted.” Open-ended, multi-file, requires reading the codebase before doing anything.
We tracked wall-clock time, total tokens consumed, whether the diff merged cleanly, and whether the test suite stayed green at the end.
Where each tool diverged
Claude Code finished Task A in roughly four minutes per trial. The service-layer extraction was clean: it picked up the project’s existing repository pattern from a sibling module without prompting and matched the naming convention. Two of three trials passed the smoke test on first run. The third introduced a circular import that Claude caught on its own follow-up turn and fixed without us asking.
Codex took longer on Task A, closer to seven minutes per trial, but produced a more aggressive refactor. It split logic into more files, added type hints throughout, and rewrote one helper function that wasn’t part of the brief. The diff was larger, the tests still passed, but the review surface went up. One trial dropped a transactional boundary we wanted preserved; the test suite caught it, Codex fixed it on the next iteration.
Task B was the more revealing split. Claude found the off-by-one in under two minutes with a one-line fix and an added test. Codex took longer on the same bug, wrote a longer explanation, and added two tests where one would have done — the second was redundant with the first.
On the race condition, Claude wrote a regression test using threading.Barrier to reliably reproduce the bug, then patched it with a context manager around the critical section. Codex initially proposed a time.sleep-based test that we rejected. On retry it produced a cleaner fix using an asyncio lock. Both eventually solved it. Claude shipped a clean version the first time.
The Unicode bug was effectively a tie. Both correctly identified that unicodedata.normalize("NFKC", ...) was the right answer and produced near-identical diffs.
Agentic workflows, pricing, and the speed-vs-cost tradeoff
Task C was where the agentic loops stretched their legs. Claude averaged a little over ten minutes wall-clock per run and burned through hundreds of thousands of tokens. Codex was meaningfully slower and more token-hungry on the same task — call it about a third more on both axes. Both produced working tracing setups with DB spans and tests that checked emitted span names against a recording exporter.
Codex’s solution was more thorough. It wired up OTLPSpanExporter with environment-variable config, added a pyproject.toml extra so the dependency was opt-in, and dropped a fresh docs/observability.md into the repo. Claude’s solution was tighter: it hooked into the existing Flask middleware, added one fixture, and stopped. If you want a starting point you will extend yourself, Claude got there faster. If you want a near-complete drop-in, Codex did more of the work — at higher cost.
Pricing during our test window: Claude Code running on Sonnet was the cheaper option per task by a clear margin. Codex on GPT-5 was higher both per call and in tokens consumed. Both Anthropic and OpenAI shifted prices during our window. Check current rates before extrapolating — order-of-magnitude conclusions are stable, but the gap may narrow or widen between when we tested and when you read this.
The speed difference was consistent across trials: Claude was faster on most tasks we threw at it, sometimes by a wide margin on small fixes. Codex was more methodical, which costs you wall-clock time and tokens but occasionally catches things Claude skips.
Cursor
Want to drive Claude or GPT-5 models from a full IDE instead of the terminal? Cursor wraps both with shared context, agent mode, and codebase indexing.
Free tier; Pro $20/mo
Affiliate link · We earn a commission at no cost to you.
When to pick which
Pick Claude Code when you’re doing focused work — a single bug, a contained refactor, a feature that touches three files. The speed advantage compounds when you’re iterating, and the cost difference adds up across a workday.
Pick Codex when you want broader autonomy and don’t mind a longer wall-clock loop. Big migrations, codebase-wide instrumentation, tasks where you would rather review a thorough proposal than steer one. Codex is also the better pick if you already pay for a ChatGPT Team or Enterprise seat that bundles Codex usage.
Both tools changed our review workflow more than they changed our writing workflow. We spent less time typing and more time reading diffs. That is the benchmark that matters more than tokens or seconds: not who produces code faster, but who produces code you trust enough to merge without re-reading every line.