OpenAI GPT-Realtime-2: What GPT-5-Class Reasoning Actually Changes for Voice Agents

OpenAI shipped three speech-focused models in one release, and the one drawing attention is GPT-Realtime-2 — the first voice model OpenAI describes as carrying GPT-5-class reasoning. If you build voice agents, that claim is worth more scrutiny than a launch post invites. We looked at what genuinely changes when a real-time speech model can reason, and what stays exactly as hard as it was last week.

Why reasoning inside a voice model is a real shift

For most of the last two years, a voice agent meant one of two architectures, and both carried a known weakness.

The pipeline approach chains three services: speech-to-text transcribes the user, a text LLM decides what to say, and text-to-speech voices the reply. You get a capable reasoning model in the middle, but every hop adds latency, and the transcription step discards tone, hesitation, and overlap — the things that make a conversation feel like one exchange instead of three.

The native speech model approach skips transcription entirely. The model takes audio in and produces audio out, which keeps latency low and preserves how something was said. The tradeoff has been reasoning depth. Earlier real-time speech models were fast and natural but thin on inference. You felt it in specific ways: the agent dropped the second half of a two-part instruction, lost the thread after an interruption, or confidently answered a question that required a step of logic it never took.

GPT-Realtime-2’s pitch is that the model doing the talking is now also the model doing the thinking, at a tier OpenAI labels GPT-5-class. The bar to watch is whether the agent can hold a multi-step task across interruptions — “book the 9am, no, the slot after that, and put it on my work calendar” — without a separate orchestration layer patching the gaps. That is the failure mode native speech models have owned, and it is the one this release is aimed at.

Speech-to-speech still forces an architecture decision

A reasoning-capable real-time model does not retire the pipeline-versus-native decision. It changes the inputs.

Native speech-to-speech wins on latency and on everything non-verbal — emotion, pacing, the cue that a user is about to interrupt. With reasoning folded in, you give up less by going native than you used to. But you also lose what a pipeline handed you for free: a text transcript you can log and audit, deterministic tool-calling you wrote yourself, and the freedom to swap the language model without re-architecting the audio path.

The honest read for most teams: if you already shipped a pipeline that works, a reasoning-capable native model is a reason to re-evaluate, not a reason to rip it out this quarter. If you are starting fresh, native speech-to-speech with reasoning built in is a stronger default than it was even six months ago.

What to test before you migrate

Treat the GPT-5-class label as a hypothesis to falsify, not a spec sheet. A short, structured eval will tell you more than any launch benchmark.

Multi-step retention: give the agent a three-part request, interrupt it halfway, and check that it still completes all three parts.
Interruption handling: talk over the agent mid-sentence and confirm it stops, listens, and folds in the new input instead of finishing its scripted reply.
Latency under load: measure time-to-first-audio with your actual system prompt and tool definitions, not a bare prompt.
Tool-call accuracy: voice agents fail loudly when they call the wrong function. Verify the model picks the right tool from a realistic set, not a toy set of two.
Graceful uncertainty: ask something the agent cannot know and confirm it says so, instead of inventing an answer in a confident voice.

Building that eval harness is itself a coding task — wiring the speech API, capturing audio timings, scoring transcripts — and it is the kind of glue code an AI-assisted editor speeds up.

Cursor

An AI-native code editor that helps you scaffold voice-agent integrations, evaluation harnesses, and speech API glue without leaving your editor.

Free tier; Pro from $20/month

Try Cursor

Affiliate link · We earn a commission at no cost to you.

None of this argues for waiting. Voice agents have been bottlenecked on reasoning for two years, and a real-time model that closes that gap is a genuine unlock. It argues for migrating on evidence — your audio, your prompts, your latency budget — rather than on a launch headline.

FAQ

Is GPT-Realtime-2 a drop-in replacement for the existing Realtime API?

Treat it as a new model option rather than an automatic upgrade. Even when an API surface stays compatible, a model with deeper reasoning can shift response latency, verbosity, and tool-calling behavior. Re-run your evals against it before assuming your current integration behaves the same way.

Does GPT-5-class reasoning make voice agents slower?

It can. Reasoning takes compute, and compute takes time that is audible in a live conversation. How much depends on how the model is tuned for real-time use. Measure time-to-first-audio with your own prompts rather than assuming the latency you saw from a text model carries over.

Should I abandon my speech-to-text to LLM to text-to-speech pipeline?

Not reflexively. Pipelines still give you auditable transcripts, deterministic tool-calling, and model portability. A reasoning-capable native model narrows that advantage but does not erase it. Run both in parallel, compare on your own metrics, and migrate when the native path wins.

OpenAI GPT-Realtime-2: What GPT-5-Class Reasoning Actually Changes for Voice Agents

Why reasoning inside a voice model is a real shift

Speech-to-speech still forces an architecture decision

What to test before you migrate

Cursor

FAQ

Aider vs Continue.dev: Terminal-First vs Editor-First AI Coding in 2026

AI Code Review Tools Compared: CodeRabbit, Greptile, and Diamond in 2026

Using Claude Code Subagents for Parallel Refactoring: A Hands-On Workflow

Cline vs Roo Code: Comparing Open-Source Agentic Coding Extensions in 2026

How to Build a Skills Library for Your AI Engineering Team

Get the best tools, weekly