pickuma.
AI & Dev Tools

How to Build an Autonomous AI Coding Agent That Opens GitHub PRs Overnight

A practical breakdown of the plan-execute-verify loop behind an autonomous AI coding agent, and how to wire it to GitHub so an issue becomes a reviewable pull request overnight.

6 min read

You file an issue before bed: “Migrate the date helpers off moment.js.” You wake up to a draft pull request — branch created, files changed, tests green, waiting for review. That is the pitch for an autonomous AI coding agent, and the surprising part is how little of it is novel. The hard problem is not the model. It is the loop around the model: the harness that turns a task into a reviewable PR with nobody in the chair.

We built this pattern and ran it against real repositories. What follows is the architecture that held up, the GitHub wiring that kept it safe, and an honest account of which tasks it finishes and which it quietly botches.

Anatomy of the overnight loop

An autonomous coding agent is a state machine with a language model wired into a few of its transitions. Strip away the marketing and five stages remain:

  1. Ingest — pull the task (a GitHub issue, a queue row, a line in a file) and the repo into a clean working directory.
  2. Plan — one model call reads the task and the repo layout, then emits a concrete plan: which files change, in what order, and what “done” looks like.
  3. Execute — a separate model call edits files to match the plan, one coherent change at a time.
  4. Verify — run the test suite, the type checker, and the linter.
  5. Package — commit, push a branch, open a pull request.

The mistake most first attempts make is collapsing stages two through four into one enormous prompt: “here is the repo, here is the task, output the diff.” That works for a three-line fix and falls apart on anything larger. Chaining narrow steps buys you something a single prompt cannot — a checkpoint between each stage where the work can be inspected before the agent commits to it.

Stage four is what separates a coding agent from a code generator. A model with no feedback loop will cheerfully report success on code that does not compile. Wire the executor to the verifier so a failing test run feeds the actual error text back into the next edit. Bound the retries — three attempts is a sane ceiling — and if the agent still cannot reach green, it should stop, open the PR as a draft, and log the failure rather than push broken code or loop forever burning tokens.

Wiring it to GitHub without losing a finger

Once the agent has a green working tree, the GitHub mechanics are routine. The pattern that held up for us:

  • One branch per task, named predictably — agent/142-moment-migration, keyed off the issue number. Predictable names make reruns idempotent: if the branch already exists, update it instead of spawning a duplicate.
  • Open the PR as a draft and assign yourself as reviewer. Draft status tells the rest of the team the change is not merge-ready and discourages a reflexive approval.
  • Label every bot-authored PRagent-generated or similar. That label is your provenance trail, and if the diff reaches users it is the basis for any disclosure you owe them.
  • Let CI run on the agent’s PR exactly as it would on a human’s. CI is the safety net the agent’s own verify stage cannot fully replace, because it runs in a clean environment you control rather than the agent’s sandbox.

The gh CLI keeps packaging short: gh pr create --draft --base main --head agent/142. For richer control — adding labels, requesting reviewers, reading PR state back — the Octokit REST client earns its dependency.

For the trigger you have two clean options. A cron job firing at 2 a.m. drains a task queue overnight. Or a GitHub webhook: label an issue agent-ready, and the labeling event starts a run. The webhook route is closer to a real workflow, because the task and its trigger live where your team already works.

Cursor

If maintaining the orchestration harness yourself is more than you signed up for, Cursor's background agents run a comparable plan-execute-verify loop against your repo and open a branch for review — with no loop code to keep alive.

Pro plan from $20/month; background agents billed by usage

Try Cursor

Affiliate link · We earn a commission at no cost to you.

What it automates well — and where it stalls

The overnight agent earns its keep on work that is mechanical but tedious: bumping a dependency and fixing the fallout, adding test coverage to an untested module, migrating a deprecated API call across a codebase, running a codemod, converting config files, tidying stale documentation. These tasks share one trait — “done” is objectively checkable. A passing test suite or a clean type check is enough for the verify stage to know it succeeded.

It stalls on the opposite kind of work. Ambiguous requests (“make the dashboard feel faster”), cross-cutting architecture changes, anything resting on product judgment, and — most important — any task in a repo whose test suite is thin. No verify gate means no safety, and the agent’s confidence in its own output is not a substitute for one.

Two numbers decide whether this is worth running. The first is cost per run: every stage is one or more model calls, and a non-trivial task with retries can reach dozens. Set a hard token or dollar ceiling per run so a stuck agent cannot run up a bill while you sleep. The second is PR acceptance rate — the share of generated PRs you merge without substantial rewriting. If you rewrite more than you keep, the tasks are scoped too loosely; tighten them until the agent succeeds reliably, then widen the scope carefully from there.

The morning review stays non-negotiable. The agent’s job is to hand you a pull request you can judge in a few minutes instead of work that would have cost you an hour. A bad PR you rubber-stamp at 9 a.m. is worse than no PR — so treat every overnight diff as untrusted until CI is green and you have read it yourself. Built this way, the agent is not a replacement for you. It is a night shift that handles only the boring parts.

FAQ

Do I need to fine-tune a model for this? +
No. A general frontier model with careful prompting and a working verify loop handles the editing. The leverage lives in the harness — the plan, execute, and verify stages — not in a custom model. Fine-tuning adds cost and maintenance for a gain you will not notice on boilerplate tasks.
Is it safe to give an agent GitHub write access? +
Treat it like an untrusted contributor. Use a fine-grained personal access token scoped to one repository, open pull requests as drafts, enforce branch protection on main, and require CI plus a human review before any merge. With those four controls the failure mode is a wasted review, not a production incident.
How much does one overnight run cost? +
It depends on task size and model, and it varies enough that a fixed figure would mislead. Each stage is at least one model call, and retries multiply that. Budget by setting a hard per-run ceiling and logging token usage for the first few runs so you can size real tasks against real numbers.

Related reading

See all AI & Dev Tools articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.