Claude as a User-Space IP Stack: What an ICMP Ping Benchmark Reveals About LLM Latency

Adam Dunkels — the engineer behind uIP and lwIP, the embedded TCP/IP stacks that ship in millions of devices — recently asked a deliberately absurd question: what if the IP stack itself were a language model? His experiment wires Claude into user space, hands it raw packets, and asks it to respond to ICMP echo requests like any other host on the network.

The setup is whimsical. The latency numbers are not. Once you stop laughing at the idea of pinging an LLM, the benchmark becomes one of the more honest stress tests we have for agentic Claude API workflows.

The Experiment: Routing ICMP Through a Language Model

Dunkels’ rig hands Claude the bytes of an inbound ICMP echo request and asks it to produce the bytes of the correct ICMP echo reply. There is no clever pre-processing. The model has to understand the IP header, swap source and destination addresses, recalculate the checksum, and emit a well-formed response packet.

The reason this works at all is that the protocol is small, deterministic, and famously documented. The reason it is slow is that every hop through the stack now includes a Claude API roundtrip — a TLS handshake (or pooled connection), token generation, and a response back to user space.

A kernel-resident IP stack answers a ping in tens to hundreds of microseconds. A round trip on a residential network is typically 10–40 milliseconds. Claude, as a user-space IP stack, lives several orders of magnitude further out. That gap is the entire point.

Why Latency Matters: Where Agentic Loops Actually Break

If you build with the Claude API, you already know the model is not instant. But the ping benchmark is useful because it strips the workload down to almost nothing — a few dozen bytes in, a few dozen bytes out — and the latency is still dominated by inference, not network or compute.

That has practical consequences for how you design agents:

Tool-use loops compound. An agent that takes ten round trips to plan, call a tool, observe, and replan is multiplying a per-call latency that already starts in the hundreds of milliseconds. The ping floor tells you what the cheapest possible step costs.
Streaming hides nothing on the first token. Time-to-first-token still gates any interaction that needs a complete response before the next step. Ping responses are short enough that TTFT and full-response latency converge — exactly the regime most tool calls live in.
Per-request variance is real. Anyone who has run a Claude API workload at scale has seen p50 and p99 diverge sharply under load. A ping benchmark surfaces that variance honestly, because the workload is otherwise constant.

We ran our own back-of-the-envelope on what this means for agent design: if a thinking step in a multi-step agent costs roughly one Claude-ping worth of latency, then a ten-step plan is already in the multi-second range before you account for tool execution, retries, or rate limits. That is fine for an editor companion. It is painful for anything in front of a user clicking a button.

Practical Lessons for Building With the Claude API

The Dunkels experiment is fun. The lessons are boring, and that is the point. If you read the benchmark and walk away with three rules, you have extracted most of the value:

Use the LLM at the right altitude. Do not ask Claude to do what memcpy and a checksum routine already do. Ask it to do what a deterministic function cannot: interpret intent, summarize, decide between options, or write code that runs later.
Budget latency before you build the agent. Multiply your worst-case step latency by your expected step count. If the product is more than your user will tolerate, redesign before you write the prompts.
Cache aggressively at the prompt boundary. Prompt caching is the single biggest lever for cutting per-step latency on repeated workloads — and the ping benchmark is implicitly an uncached workload, which is why the floor looks the way it does.

The takeaway is not that Claude is slow. It is that Claude is a particular shape of fast — fast at language, slow at bytes — and the systems you build need to respect that shape.

Cursor

If you are prototyping LLM-in-the-loop systems, Cursor pairs Claude with codebase-aware tool use — the regime where the ping latency floor shows up first in real workflows.

Free / $20 per month / $40 per month

Try Cursor

Affiliate link · We earn a commission at no cost to you.

When LLM-in-the-Loop Networking Actually Makes Sense

There is a serious version of this experiment buried inside the joke. LLMs in the network stack are absurd at the ICMP layer. They are interesting at the policy layer — deciding what to do with a flagged packet, summarizing a flow record, deciding whether a request looks like abuse. Anywhere the work is read this, decide that, the latency cost of a Claude call competes against the human or the rule engine you would otherwise reach for, not against a kernel routine.

The ping benchmark sets the lower bound. Your job, as a developer building on the Claude API, is to keep the work above that bound — and to make sure the latency you pay buys you something a regex could not.

FAQ

Does this mean Claude cannot be used in real-time systems? +

It means Claude cannot be used in microsecond-scale paths like packet forwarding. Sub-second interactive use cases — chat, completion, agentic tool calls — are fine if you budget for typical Claude API latency and design around it with streaming and caching.

Is there a faster Claude model for low-latency use cases? +

Haiku class models are designed for lower latency and cost than Sonnet or Opus. They do not change the order-of-magnitude story relative to a kernel IP stack, but they meaningfully cut per-call latency for short prompts and short responses — the workload shape the ping benchmark exercises.

What is the practical upshot for agent design? +

Treat each Claude call as a budgeted, latency-bearing step. Minimize step count, cache prompts, and avoid putting LLM calls in any hot path where deterministic code would do. Reserve the model for the judgment work that no other tool in your stack can do.