The Circuit Breaker Pattern, Explained for Resilient Systems

One dependency starts answering slowly. Your service keeps calling it, every request piling up behind a 30-second timeout. Threads block. The connection pool drains. Now requests that never touched the slow dependency are timing out too, because there are no threads left to serve them. A single degraded downstream has taken down a service that didn’t need it.

The circuit breaker pattern exists to cut that chain. It wraps a call to a remote dependency and watches the failure rate. When failures cross a threshold, it stops calling — it “trips” — and fails fast instead of waiting on timeouts. The name comes straight from the electrical panel: when current spikes, the breaker opens so the wiring behind it doesn’t melt.

The three states

A circuit breaker is a small state machine with three states, and understanding them is most of understanding the pattern.

Closed is the normal state. Calls pass through to the dependency. The breaker counts outcomes — successes, failures, and (in better implementations) slow calls that succeed but took too long. As long as the failure rate stays under the threshold, nothing changes.

Open is the tripped state. The breaker has seen too many failures and stops forwarding calls entirely. Every request returns immediately with an error or a fallback value — no network call, no timeout wait. This is the whole point: a call that fails in microseconds instead of blocking a thread for 30 seconds. The breaker stays open for a fixed wait duration.

Half-open is the probe. After the wait duration elapses, the breaker lets a small number of trial calls through. If they succeed, it assumes the dependency recovered and returns to closed. If they fail, it snaps back to open and waits again. Half-open is what lets the system heal on its own without a human restarting anything.

What the thresholds actually are

The pattern is simple. Tuning it is where teams get it wrong, usually by guessing at numbers. It helps to look at what a mature library ships as defaults. Resilience4j, the widely used JVM implementation, defaults to:

Failure rate threshold: 50%. The breaker trips when half the calls in its window fail.
Sliding window size: 100 calls, count-based. Failure rate is computed over the last 100 calls, not all-time.
Minimum number of calls: 100. The breaker won’t trip until it has seen at least 100 calls, so a single early failure can’t open it on a cold start.
Wait duration in open state: 60 seconds. How long it stays open before probing.
Permitted calls in half-open: 10. The number of trial calls it allows before deciding recovered-or-not.
Slow call duration threshold: 60 seconds, with a slow-call rate threshold of 100%.

Those last two matter more than people expect. A dependency that returns HTTP 200 in 25 seconds is technically succeeding while it quietly destroys your latency budget. Treating slow-but-successful calls as failures is what separates a useful breaker from one that only reacts to hard errors.

The right values are workload-specific. A high-traffic service might use a sliding window of seconds rather than a call count, because 100 calls can arrive in a blink. A low-traffic internal endpoint needs a small minimum-calls value or the breaker will never accumulate enough samples to trip at all.

Fallbacks, and the parts people skip

A tripped breaker has to return something. The fallback is what your code does when the circuit is open, and a good one is the difference between graceful degradation and a stack trace in the user’s face.

Good fallbacks are honest about being degraded: a cached value from the last successful call, an empty list with a “results may be incomplete” flag, a default recommendation set instead of personalized ones. Bad fallbacks hide the problem — silently swallowing the error and returning success makes the outage invisible until something downstream corrupts.

Three details routinely get missed:

Per-dependency breakers, not one global breaker. Each remote dependency needs its own breaker instance. A shared breaker means a flaky recommendations service can trip the circuit for your healthy payments service. Isolate them.

Breakers pair with timeouts, they don’t replace them. A breaker counts failures, but a call has to fail before it counts. Without an aggressive timeout, your first wave of calls still blocks for the full 30 seconds before the breaker has anything to react to. The timeout produces the failures; the breaker reacts to the pattern of them.

Emit metrics on state transitions. Every open/half-open/closed transition should be a logged event and a metric. A breaker that trips and recovers silently robs you of the single clearest signal that a dependency is struggling. The transition log is often the first thing that tells you which downstream broke.

If you implement this yourself, you’ll reinvent these one painful incident at a time. Reach for an established library — Resilience4j on the JVM, Polly in .NET, or the breaker built into your service mesh (Istio and Envoy both do this at the proxy layer, no code change) — before you hand-roll a state machine.

FAQ

How is a circuit breaker different from a retry?

They solve opposite halves of the same problem. A retry assumes the failure is transient and tries again — useful for a dropped packet, harmful for an overloaded server, since retries add load to something already drowning. A circuit breaker assumes repeated failure means the dependency is genuinely down and stops calling entirely. Production systems use both: retry with backoff for individual transient errors, a breaker to give up once failures become a pattern.

What is the half-open state for?

It's the self-healing probe. After the breaker has been open for its wait duration, half-open lets a small number of trial calls through to test whether the dependency recovered. Success closes the circuit and resumes normal traffic; failure reopens it. Without half-open, a tripped breaker would either stay open forever or need a human to reset it.

Should every external call have a circuit breaker?

Wrap calls to dependencies whose failure could exhaust your resources or cascade — databases, third-party APIs, internal microservices reached over the network. In-process calls and trivially cheap operations don't need one; the breaker's bookkeeping isn't free. The test is whether a slow or failing response could block threads or connections that other requests need.

The pattern is small, but it changes the failure mode of a distributed system from “one slow dependency takes everything down” to “one dependency degrades and the rest keeps serving.” That’s the entire reason it’s worth the bookkeeping.

The Circuit Breaker Pattern, Explained for Resilient Systems

The three states

What the thresholds actually are

Fallbacks, and the parts people skip

FAQ

TCP vs UDP, Explained Through What Breaks When You Pick Wrong

Write-Ahead Logging: How Databases Survive a Power Cut

Backpressure, Explained Through a Queue That Won't Fall Over

What a Bloom Filter Actually Saves You (and When It Lies)

Idempotency, Explained Through the Retry That Doesn't Double-Charge

Get the best tools, weekly