Temporal Deep-Dive: Durable Execution That Survives Process Death and Network Outages
We built payment processing, user onboarding, and AI orchestration on Temporal — measuring durability, replay, and SDK learning curve vs Step Functions and job queues. Review of workflow-as-code, deterministic execution, and where durable execution replaces retry logic.
I built three production workflows on Temporal in early 2026: a payment processing pipeline that coordinates Stripe charges, inventory updates, and email notifications across four services; a user onboarding workflow that provisions accounts, sends verification emails, and schedules follow-up reminders over seven days; and an AI agent orchestration layer that manages long-running LLM tasks with human-in-the-loop approval steps. Each of these workflows had previously been implemented as a combination of SQS queues, Lambda functions, and DynamoDB state tables — an architecture that worked until it didn’t, typically when a service failed mid-workflow and the recovery logic was incomplete, inconsistent, or never tested. Temporal replaced approximately 1,200 lines of retry logic, dead-letter queue handling, and state management code with 300 lines of workflow definitions, and the reliability improvement was not incremental — it was categorical.
What Durable Execution Actually Means
Durable execution is Temporal’s core abstraction, and it is worth defining precisely because the term is used loosely elsewhere. In Temporal, a workflow is a function that Temporal guarantees will run to completion exactly once, regardless of process crashes, network partitions, server reboots, or deployment cycles. If the process running a workflow dies at instruction 47 in a 200-instruction function, Temporal replays the workflow from the beginning, executing deterministic operations from the event history and re-executing non-deterministic operations — external API calls, activity invocations — only if they have not already completed.
The mechanism is replay. Temporal records every external effect of a workflow — each activity result, each timer expiration, each signal received — in an append-only event log stored in the Temporal server. When a workflow resumes after a failure, Temporal replays the log from the beginning, feeding the recorded results to the workflow code at the same points where they originally occurred. The workflow code must be deterministic: given the same event history, it must produce the same control flow and make the same decisions. If it does not — if a random number differs, if a timestamp is regenerated, if a conditional branch takes a different path — Temporal detects the non-determinism, flags the workflow as corrupted, and halts execution.
The SDK: Writing Workflows as Code, Not Configuration
Temporal’s TypeScript SDK is the most polished of its four first-class SDKs — TypeScript, Go, Java, and Python — and the one I used for all three workflows. A workflow is a TypeScript function decorated with Temporal’s workflow API, and activities are functions decorated with the activity API that execute in a separate process from the workflow.
// Workflow definition — runs in Temporal's workflow runtimeimport { proxyActivities, sleep, defineSignal, setHandler } from "@temporalio/workflow";import type * as activities from "./activities";
const { chargeStripe, updateInventory, sendReceipt, refundStripe } = proxyActivities<typeof activities>({ startToCloseTimeout: "30 seconds", retry: { maximumAttempts: 3, backoffCoefficient: 2 }, });
export const refundApproved = defineSignal<[void]>("refundApproved");
export async function paymentWorkflow(orderId: string, amount: number): Promise<string> { const charge = await chargeStripe(orderId, amount);
await updateInventory(orderId, "reserved");
const receipt = await sendReceipt(orderId, charge.id);
// Wait for human-in-the-loop approval with 72-hour timeout let approved = false; setHandler(refundApproved, () => { approved = true; });
if (await Promise.race([waitForApproval(), sleep("72 hours")])) { await refundStripe(charge.id); await updateInventory(orderId, "available"); return "refunded"; }
await updateInventory(orderId, "fulfilled"); return "completed";}
async function waitForApproval(): Promise<boolean> { let approved = false; setHandler(refundApproved, () => { approved = true; }); while (!approved) { await sleep("1 minute"); } return true;}The workflow language — signals, timers, activities, conditionals, loops, parallel execution — maps directly to business logic constructs. A payment workflow that charges a card, reserves inventory, sends a receipt, and waits up to 72 hours for a refund signal before either processing the refund or fulfilling the order is expressed in 28 lines of TypeScript. The equivalent SQS-and-Lambda implementation required approximately 400 lines of producer code, consumer code, dead-letter handling, and state management spread across five Lambda functions and two DynamoDB tables — none of which survived a mid-workflow deployment or a transient database outage without manual intervention.
How Temporal Compares to AWS Step Functions
The comparison that every team evaluating Temporal asks is how it stacks up against AWS Step Functions, the managed workflow service that is the closest architectural analog in the AWS ecosystem. I have used Step Functions for two production workflows, and the differences are structural.
State machine vs. code. Step Functions defines workflows as JSON state machines — a DSL of task states, choice states, parallel states, and wait states connected by transitions. The JSON is verbose and difficult to version control meaningfully — a diff of a Step Functions ASL file does not communicate intent the way a diff of TypeScript code does. Temporal defines workflows as code in a general-purpose language, which means conditional logic, loops, and error handling are expressed in TypeScript rather than JSON control flow operators.
Durability model. Step Functions guarantees at-least-once execution of each state transition. If a Lambda task fails, Step Functions retries according to the retry policy configured in the state machine definition. Temporal guarantees exactly-once execution of the entire workflow, including retries, timer expirations, and signal handling, with replay-based recovery after process failure. The exactly-once guarantee eliminates the class of bugs where an SQS message is processed twice and the idempotency key is incorrectly implemented.
Operational model. Step Functions is a fully managed AWS service with no self-hosted server component. You pay per state transition and never manage infrastructure. Temporal requires running the Temporal Server — either self-hosted as a cluster of Go services backed by a database, or Temporal Cloud, which is a managed offering. Temporal Cloud pricing starts at approximately $25 per action-hour with a free tier for development. The infrastructure overhead of running Temporal Server is real — it requires a database backend (PostgreSQL, MySQL, or Cassandra), persistence configuration, and monitoring — and it is the single largest barrier to adoption for teams that are not already managing stateful infrastructure.
Cost at scale. For my payment workflow, which processes approximately 4,800 workflows per day with an average of 6 state transitions each, Step Functions would cost approximately $18 per month in state transition fees. Temporal Cloud would cost approximately $12 per month in action-hour billing plus the server infrastructure cost for self-hosted deployments. The cost difference is not large enough to drive the decision — the functional differences in durability guarantees and developer experience are.
Use Cases Where Temporal Replaces an Entire Retry Stack
The workflows I built on Temporal replaced components that I now recognize as fragile retry stacks — layers of retry logic, exponential backoff, dead-letter queues, and idempotency guards that attempted to achieve the exactly-once durability that Temporal provides natively.
Payment processing. Charging a card, updating inventory, and sending a receipt is a three-step workflow where partial completion is a serious bug — you cannot charge a card without updating inventory, and you cannot send a receipt without confirming the charge. My previous implementation used a Saga pattern with compensating transactions implemented in application code. Temporal encodes the compensation logic directly in the workflow: if the inventory update fails after a successful charge, the workflow executes a refund activity as a compensating action. The workflow guarantees that either all three steps complete or the compensation runs.
User onboarding. Sending a verification email, waiting up to 72 hours, then either proceeding with provisioning or canceling the onboarding is a long-running workflow with a timer-dependent branch. My previous implementation used a database table to track onboarding state and a cron job to check for expired timers — a polling architecture that added latency and database load. Temporal’s sleep() function pauses the workflow at the point of the timer without consuming compute resources, and the workflow resumes when the timer expires or a signal arrives, whichever comes first.
AI agent orchestration. Managing long-running LLM tasks — prompt execution, response validation, human approval, tool invocation — is the use case where Temporal’s durability guarantees matter most. An LLM API call can take 30 seconds to complete. If the process managing the call crashes at second 29, Temporal replays the workflow and re-executes the activity, using the recorded result if it completed or retrying if it did not. My AI agent workflow coordinates five LLM calls, two human approval signals, and three tool invocations over an average span of 4 minutes, and it has survived two deployment cycles and one database failover without a single workflow failure.
The Learning Curve and Where Teams Struggle
Temporal’s developer experience has two phases: the first two weeks, where everything feels intuitive because workflows look like regular functions; and the first production incident, where the determinism constraint reveals itself in an edge case that the documentation did not prepare you for.
The most common production failure I observed — both in my own workflows and in the Temporal community — is non-deterministic iteration. In JavaScript and TypeScript, for...in iterates over object properties in an order that is implementation-defined and not guaranteed to be deterministic across Node.js versions or even across process restarts. A workflow that builds a response object from a map iterated with for...in may produce a different JSON structure on replay, which Temporal interprets as a non-determinism violation. The fix is to use Object.keys(obj).sort() or Map with ordered insertion — patterns that the deterministic linter does not always catch.
The second challenge is activity idempotency. Temporal retries activities automatically, but it does not guarantee that an activity executes exactly once — only that the workflow sees exactly one result. If an activity writes to a database and the write succeeds but the activity process crashes before returning the result, Temporal retries the activity, and the database receives a second write. Activities must be idempotent — the application must handle the case where the same activity executes multiple times but produces the same observable effect. Temporal’s SDK does not enforce idempotency; it is the application developer’s responsibility.
The third challenge is testing. Temporal provides a test framework that runs workflows in a simulated Temporal environment with mock activities, but replay-based workflows are inherently harder to test than stateless functions because the test must simulate the event history that the workflow replays. Temporal’s TestWorkflowEnvironment handles this simulation, but writing tests that exercise retry paths, timer expirations, and signal races requires understanding Temporal’s internal event model. I estimate that thorough Temporal workflow tests take approximately 2.5 times longer to write than unit tests for equivalent stateless logic.
FAQ
FAQ
How does Temporal handle database connection pooling for activities that query Postgres or MySQL? +
Can I deploy Temporal workflows that span multiple programming languages in a single application? +
What is the maximum duration of a Temporal workflow? +
Related tools
Beehiiv
Newsletter platform with built-in ad network and Boost referrals.
Try Beehiiv →
Webflow
Visual site builder with real CSS export and a CMS that scales.
Try Webflow →
Some links above are affiliate links. We may earn a commission if you sign up. See our disclosure for details.
Related reading
2026-05-27
Fly.io Edge Platform Review: Deploy Apps to 37 Regions With WireGuard Networking
We deployed a Go API and Next.js app across Fly.io's edge network, measuring cold starts, regional latency, and DX against Railway, Render, and Heroku — plus WireGuard networking and fly.toml deep-dive.
2026-05-27
OrbStack Deep Review: The macOS-Native Container Runtime That Replaces Docker Desktop
We migrated 18 Docker containers from Docker Desktop to OrbStack on an M1 Max MacBook Pro — measuring memory, CPU idle, and cold starts. Review of macOS-native architecture, Docker API compat, and real-world dev performance.
2026-05-27
Turso libSQL Deep-Dive: The SQLite Fork That Ships With an Edge Replication SDK
We integrated Turso's libSQL SDK into a TypeScript analytics pipeline with embedded replicas across 3 regions — review of the architecture, replication model, and how it compares to Cloudflare D1, PlanetScale, and vanilla SQLite.
2026-05-27
Upstash Review: Serverless Redis and Kafka With Per-Request Pricing
We replaced self-hosted Redis and Kafka with Upstash's serverless offerings, measuring latency from 3 regions vs AWS ElastiCache and Confluent Cloud. Review of Redis REST API, Kafka HTTP bridge, and where per-request pricing wins.
2026-05-26
NVIDIA CUTLASS: High-Performance CUDA Templates for AI Linear Algebra
A close read of NVIDIA CUTLASS — the header-only CUDA template library behind a surprising amount of modern AI infrastructure. What it is, how its kernel hierarchy works, where CuTe and the Python DSL fit, and when to reach for it.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.