Temporal Deep-Dive: Durable Execution That Survives Process Death and Network Outages

I built three production workflows on Temporal in early 2026: a payment processing pipeline that coordinates Stripe charges, inventory updates, and email notifications across four services; a user onboarding workflow that provisions accounts, sends verification emails, and schedules follow-up reminders over seven days; and an AI agent orchestration layer that manages long-running LLM tasks with human-in-the-loop approval steps. Each of these workflows had previously been implemented as a combination of SQS queues, Lambda functions, and DynamoDB state tables — an architecture that worked until it didn’t, typically when a service failed mid-workflow and the recovery logic was incomplete, inconsistent, or never tested. Temporal replaced approximately 1,200 lines of retry logic, dead-letter queue handling, and state management code with 300 lines of workflow definitions, and the reliability improvement was not incremental — it was categorical.

What Durable Execution Actually Means

Durable execution is Temporal’s core abstraction, and it is worth defining precisely because the term is used loosely elsewhere. In Temporal, a workflow is a function that Temporal guarantees will run to completion exactly once, regardless of process crashes, network partitions, server reboots, or deployment cycles. If the process running a workflow dies at instruction 47 in a 200-instruction function, Temporal replays the workflow from the beginning, executing deterministic operations from the event history and re-executing non-deterministic operations — external API calls, activity invocations — only if they have not already completed.

The mechanism is replay. Temporal records every external effect of a workflow — each activity result, each timer expiration, each signal received — in an append-only event log stored in the Temporal server. When a workflow resumes after a failure, Temporal replays the log from the beginning, feeding the recorded results to the workflow code at the same points where they originally occurred. The workflow code must be deterministic: given the same event history, it must produce the same control flow and make the same decisions. If it does not — if a random number differs, if a timestamp is regenerated, if a conditional branch takes a different path — Temporal detects the non-determinism, flags the workflow as corrupted, and halts execution.

Heads up

The determinism constraint is the hardest part of Temporal and the most common source of production failures for new users. You cannot call Date.now(), Math.random(), uuid(), or any non-deterministic function directly in workflow code. You must use Temporal’s Workflow.now(), Workflow.random(), and uuid4() equivalents, which record the generated value in the event history so replay produces the same value. You cannot iterate over a map or set without sorting, because Go and TypeScript map iteration order is non-deterministic. You cannot use process.env, fs.readFileSync, or any OS-level call. Temporal’s TypeScript SDK includes a deterministic linter that catches most violations at compile time, but runtime non-determinism errors are cryptic and difficult to debug without the replay viewer in Temporal’s web UI.

The SDK: Writing Workflows as Code, Not Configuration

Temporal’s TypeScript SDK is the most polished of its four first-class SDKs — TypeScript, Go, Java, and Python — and the one I used for all three workflows. A workflow is a TypeScript function decorated with Temporal’s workflow API, and activities are functions decorated with the activity API that execute in a separate process from the workflow.

// Workflow definition — runs in Temporal's workflow runtime
import { proxyActivities, sleep, defineSignal, setHandler } from "@temporalio/workflow";
import type * as activities from "./activities";

const { chargeStripe, updateInventory, sendReceipt, refundStripe } =
  proxyActivities<typeof activities>({
    startToCloseTimeout: "30 seconds",
    retry: { maximumAttempts: 3, backoffCoefficient: 2 },
  });

export const refundApproved = defineSignal<[void]>("refundApproved");

export async function paymentWorkflow(orderId: string, amount: number): Promise<string> {
  const charge = await chargeStripe(orderId, amount);

  await updateInventory(orderId, "reserved");

  const receipt = await sendReceipt(orderId, charge.id);

  // Wait for human-in-the-loop approval with 72-hour timeout
  let approved = false;
  setHandler(refundApproved, () => { approved = true; });

  if (await Promise.race([waitForApproval(), sleep("72 hours")])) {
    await refundStripe(charge.id);
    await updateInventory(orderId, "available");
    return "refunded";
  }

  await updateInventory(orderId, "fulfilled");
  return "completed";
}

async function waitForApproval(): Promise<boolean> {
  let approved = false;
  setHandler(refundApproved, () => { approved = true; });
  while (!approved) {
    await sleep("1 minute");
  }
  return true;
}

The workflow language — signals, timers, activities, conditionals, loops, parallel execution — maps directly to business logic constructs. A payment workflow that charges a card, reserves inventory, sends a receipt, and waits up to 72 hours for a refund signal before either processing the refund or fulfilling the order is expressed in 28 lines of TypeScript. The equivalent SQS-and-Lambda implementation required approximately 400 lines of producer code, consumer code, dead-letter handling, and state management spread across five Lambda functions and two DynamoDB tables — none of which survived a mid-workflow deployment or a transient database outage without manual intervention.

How Temporal Compares to AWS Step Functions

The comparison that every team evaluating Temporal asks is how it stacks up against AWS Step Functions, the managed workflow service that is the closest architectural analog in the AWS ecosystem. I have used Step Functions for two production workflows, and the differences are structural.

State machine vs. code. Step Functions defines workflows as JSON state machines — a DSL of task states, choice states, parallel states, and wait states connected by transitions. The JSON is verbose and difficult to version control meaningfully — a diff of a Step Functions ASL file does not communicate intent the way a diff of TypeScript code does. Temporal defines workflows as code in a general-purpose language, which means conditional logic, loops, and error handling are expressed in TypeScript rather than JSON control flow operators.

Durability model. Step Functions guarantees at-least-once execution of each state transition. If a Lambda task fails, Step Functions retries according to the retry policy configured in the state machine definition. Temporal guarantees exactly-once execution of the entire workflow, including retries, timer expirations, and signal handling, with replay-based recovery after process failure. The exactly-once guarantee eliminates the class of bugs where an SQS message is processed twice and the idempotency key is incorrectly implemented.

Operational model. Step Functions is a fully managed AWS service with no self-hosted server component. You pay per state transition and never manage infrastructure. Temporal requires running the Temporal Server — either self-hosted as a cluster of Go services backed by a database, or Temporal Cloud, which is a managed offering. Temporal Cloud pricing starts at approximately $25 per action-hour with a free tier for development. The infrastructure overhead of running Temporal Server is real — it requires a database backend (PostgreSQL, MySQL, or Cassandra), persistence configuration, and monitoring — and it is the single largest barrier to adoption for teams that are not already managing stateful infrastructure.

Cost at scale. For my payment workflow, which processes approximately 4,800 workflows per day with an average of 6 state transitions each, Step Functions would cost approximately $18 per month in state transition fees. Temporal Cloud would cost approximately $12 per month in action-hour billing plus the server infrastructure cost for self-hosted deployments. The cost difference is not large enough to drive the decision — the functional differences in durability guarantees and developer experience are.

Use Cases Where Temporal Replaces an Entire Retry Stack

The workflows I built on Temporal replaced components that I now recognize as fragile retry stacks — layers of retry logic, exponential backoff, dead-letter queues, and idempotency guards that attempted to achieve the exactly-once durability that Temporal provides natively.

Payment processing. Charging a card, updating inventory, and sending a receipt is a three-step workflow where partial completion is a serious bug — you cannot charge a card without updating inventory, and you cannot send a receipt without confirming the charge. My previous implementation used a Saga pattern with compensating transactions implemented in application code. Temporal encodes the compensation logic directly in the workflow: if the inventory update fails after a successful charge, the workflow executes a refund activity as a compensating action. The workflow guarantees that either all three steps complete or the compensation runs.

User onboarding. Sending a verification email, waiting up to 72 hours, then either proceeding with provisioning or canceling the onboarding is a long-running workflow with a timer-dependent branch. My previous implementation used a database table to track onboarding state and a cron job to check for expired timers — a polling architecture that added latency and database load. Temporal’s sleep() function pauses the workflow at the point of the timer without consuming compute resources, and the workflow resumes when the timer expires or a signal arrives, whichever comes first.

AI agent orchestration. Managing long-running LLM tasks — prompt execution, response validation, human approval, tool invocation — is the use case where Temporal’s durability guarantees matter most. An LLM API call can take 30 seconds to complete. If the process managing the call crashes at second 29, Temporal replays the workflow and re-executes the activity, using the recorded result if it completed or retrying if it did not. My AI agent workflow coordinates five LLM calls, two human approval signals, and three tool invocations over an average span of 4 minutes, and it has survived two deployment cycles and one database failover without a single workflow failure.

The Learning Curve and Where Teams Struggle

Temporal’s developer experience has two phases: the first two weeks, where everything feels intuitive because workflows look like regular functions; and the first production incident, where the determinism constraint reveals itself in an edge case that the documentation did not prepare you for.

The most common production failure I observed — both in my own workflows and in the Temporal community — is non-deterministic iteration. In JavaScript and TypeScript, for...in iterates over object properties in an order that is implementation-defined and not guaranteed to be deterministic across Node.js versions or even across process restarts. A workflow that builds a response object from a map iterated with for...in may produce a different JSON structure on replay, which Temporal interprets as a non-determinism violation. The fix is to use Object.keys(obj).sort() or Map with ordered insertion — patterns that the deterministic linter does not always catch.

The second challenge is activity idempotency. Temporal retries activities automatically, but it does not guarantee that an activity executes exactly once — only that the workflow sees exactly one result. If an activity writes to a database and the write succeeds but the activity process crashes before returning the result, Temporal retries the activity, and the database receives a second write. Activities must be idempotent — the application must handle the case where the same activity executes multiple times but produces the same observable effect. Temporal’s SDK does not enforce idempotency; it is the application developer’s responsibility.

The third challenge is testing. Temporal provides a test framework that runs workflows in a simulated Temporal environment with mock activities, but replay-based workflows are inherently harder to test than stateless functions because the test must simulate the event history that the workflow replays. Temporal’s TestWorkflowEnvironment handles this simulation, but writing tests that exercise retry paths, timer expirations, and signal races requires understanding Temporal’s internal event model. I estimate that thorough Temporal workflow tests take approximately 2.5 times longer to write than unit tests for equivalent stateless logic.

FAQ

How does Temporal handle database connection pooling for activities that query Postgres or MySQL?

Activities run in a separate process from the workflow, and your activity code manages its own database connections using whatever pooling library you would use in a standard Node.js, Go, or Java application. Temporal does not inject a connection pool or manage database access. This means you configure a connection pool in your activity code exactly as you would for any other server process — a pg Pool in Node.js, a sql.DB in Go — and each activity invocation borrows a connection from the pool. The pool must be sized to handle the maximum concurrent activity count for your workflow, which you control through the `maxConcurrentActivityTaskExecutions` worker option.

Can I deploy Temporal workflows that span multiple programming languages in a single application?

Yes. Temporal's SDKs share a common wire protocol, so a workflow written in Go can invoke activities written in TypeScript and Python as long as all workers connect to the same Temporal namespace and task queue. This is useful for organizations where different teams own different services in different languages — the payment team writes workflows in Go, the notification team writes activities in TypeScript, and the ML team writes prediction activities in Python, all orchestrated by Temporal through a shared task queue configuration. The practical constraint is that each activity worker must be deployed and scaled independently, and cross-language type compatibility must be verified through integration testing — Temporal does not enforce type contracts across SDKs.

What is the maximum duration of a Temporal workflow?

There is no hard limit on workflow duration. Temporal stores workflow event history in the database backend, and as long as the history fits within the database's storage limits and query performance remains acceptable, a workflow can run for days, weeks, or months. In practice, workflows longer than 30 days accumulate enough event history — each activity result, timer, signal, and update — to slow down replay performance. Temporal provides `continue-as-new` to split a long-running workflow into sequential segments, each with a fresh event history, at application-defined boundaries. For my user onboarding workflow with a 7-day timer, the event history is approximately 40 events, and replay completes in under 50 milliseconds — well within Temporal's performance envelope.

Temporal Deep-Dive: Durable Execution That Survives Process Death and Network Outages

What Durable Execution Actually Means

The SDK: Writing Workflows as Code, Not Configuration

How Temporal Compares to AWS Step Functions

Use Cases Where Temporal Replaces an Entire Retry Stack

The Learning Curve and Where Teams Struggle

FAQ

FAQ

Caddy vs Nginx in 2026: When Automatic HTTPS Is Worth the Switch

Hetzner vs OVH for Side Projects: Bare-Metal Value in 2026

Bun vs Node.js in Production: What Actually Changes in 2026

Coolify vs Dokploy: Self-Hosted PaaS for Solo Developers in 2026

Turso vs Neon: Serverless SQLite and Postgres Compared in 2026

Get the best tools, weekly