Automate Python Code Reviews with Free Local LLMs and GitHub Actions
Wire an open-weight model running in Ollama into a GitHub Actions workflow to get automated first-pass code-review comments on Python pull requests — no API bill required.
Paying for GPT-4o or Claude API calls every time someone opens a pull request adds up quickly on a busy repo. A self-hosted Ollama instance on a machine you already own — or a GPU-enabled self-hosted GitHub Actions runner — lets you run a capable open-weight model for the cost of electricity. The result is a first-pass automated review that catches common Python issues and leaves a comment on the PR before any human reads the diff.
This is not a replacement for human review. An open-weight 7B model running locally will miss subtle concurrency bugs, architectural problems, and context it has never seen. What it reliably does is reduce the amount of low-signal noise a human reviewer has to wade through: undocumented parameters, obvious type mismatches, functions that shadow builtins, missing error handling in obvious paths. That alone is worth setting up if your team is small and review time is scarce.
The Shape of the Workflow
The basic loop has four parts: a GitHub Actions workflow triggers on pull_request, a Python script fetches the diff via the GitHub REST API, the script sends that diff to a locally-running Ollama server, and the response comes back as a PR review comment posted through the same API.
Here is a minimal workflow file for a self-hosted runner that has Ollama already installed and the model pre-pulled:
name: LLM Code Review
on: pull_request: types: [opened, synchronize] paths: - '**.py'
jobs: review: runs-on: self-hosted # requires GPU runner with Ollama installed permissions: pull-requests: write contents: read steps: - uses: actions/checkout@v4
- name: Wait for Ollama run: curl --retry 10 --retry-delay 2 --retry-connrefused http://localhost:11434/api/tags
- name: Run review script env: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} PR_NUMBER: ${{ github.event.pull_request.number }} REPO: ${{ github.repository }} MODEL: qwen2.5-coder:7b run: python scripts/llm_review.pyThe paths filter limits runs to PRs that touch Python files, which avoids burning runner time on documentation-only changes. If your runner is not persistent (for example, you spin it up on demand), remove the Wait for Ollama step and replace it with the Ollama install script before the review step.
The Python review script does three things: fetch the diff, prompt the model, post the comment. Here is a stripped-down version:
import os, json, textwrapimport urllib.request
GITHUB_API = "https://api.github.com"OLLAMA_URL = "http://localhost:11434/api/generate"
def gh(path, method="GET", body=None): token = os.environ["GH_TOKEN"] req = urllib.request.Request( f"{GITHUB_API}{path}", data=json.dumps(body).encode() if body else None, headers={ "Authorization": f"Bearer {token}", "Accept": "application/vnd.github+json", "X-GitHub-Api-Version": "2022-11-28", }, method=method, ) with urllib.request.urlopen(req) as r: return json.loads(r.read())
def get_diff(): repo = os.environ["REPO"] pr = os.environ["PR_NUMBER"] req = urllib.request.Request( f"{GITHUB_API}/repos/{repo}/pulls/{pr}", headers={ "Authorization": f"Bearer {os.environ['GH_TOKEN']}", "Accept": "application/vnd.github.v3.diff", }, ) with urllib.request.urlopen(req) as r: return r.read().decode()
def ask_ollama(diff): prompt = textwrap.dedent(f""" You are a Python code reviewer. Review the following git diff for: - Bugs or likely runtime errors - Missing or incorrect type annotations - Functions that shadow Python builtins - Missing error handling in obvious paths - Style issues that violate PEP 8
Be concise. List specific findings only. Do not repeat the diff back. If the change looks correct, say so briefly.
DIFF: {diff[:12000]} """) payload = {"model": os.environ["MODEL"], "prompt": prompt, "stream": False} req = urllib.request.Request( OLLAMA_URL, data=json.dumps(payload).encode(), headers={"Content-Type": "application/json"}, method="POST", ) with urllib.request.urlopen(req, timeout=300) as r: return json.loads(r.read())["response"]
def post_comment(body): repo = os.environ["REPO"] pr = os.environ["PR_NUMBER"] gh(f"/repos/{repo}/issues/{pr}/comments", method="POST", body={"body": body})
if __name__ == "__main__": diff = get_diff() if not diff.strip(): print("Empty diff, skipping.") else: review = ask_ollama(diff) post_comment(f"**LLM first-pass review** (model: `{os.environ['MODEL']}`)\n\n{review}\n\n---\n*Automated review. Not a substitute for human review.*")The diff is truncated at 12,000 characters before being sent to the model. For a 7B model with a 4K–8K context window, sending a 40-file diff wholesale will silently truncate or produce incoherent output. The 12,000-character ceiling keeps the prompt within a safe range for 7B models while still covering most single-feature PRs. For larger diffs, you can split by file and send one prompt per changed file, then aggregate.
Choosing a Model
Three models are worth considering for this specific task. The tradeoffs map directly to the RAM available on your runner.
qwen2.5-coder:7b is the practical default. It runs in approximately 6–7 GB of VRAM or RAM, fits on a consumer GPU (RTX 3060 or similar), and performs well on Python-focused tasks. Alibaba’s Qwen2.5-Coder series was explicitly trained on code, which matters more for targeted review work than general instruction-following ability.
mistral:7b is an acceptable alternative if you already have it pulled or if you want a model with stronger general-language generation for more verbose review comments. It is not specifically trained on code, so it will miss some language-specific patterns that a coder model catches, but its instruction-following is reliable.
qwen2.5-coder:32b or similar 30B+ models produce noticeably better reviews — they can reason about multi-file interactions and catch subtler bugs — but require roughly 22–24 GB of VRAM. That pushes you toward A100 or multi-GPU setups, which changes the cost calculus significantly.
For a first deployment, start with qwen2.5-coder:7b. You can upgrade the model string in the workflow env var without touching anything else.
Self-Hosted Runners and the Cold-Start Problem
If you run Ollama on a persistent self-hosted runner — a spare workstation, a homelab server, or a cloud VM you control — the model stays in memory between runs and job startup time drops to a few seconds. The runner registers with GitHub via github.com/settings/actions/runners and picks up jobs like any other runner.
The cold-start problem appears when you do not have a persistent machine. In that case, you have two options. First, install Ollama and pull the model at the start of every job:
- name: Install Ollama run: curl -fsSL https://ollama.com/install.sh | sh
- name: Pull model run: ollama pull qwen2.5-coder:7b &
- name: Start Ollama server run: ollama serve &
- name: Wait for server run: curl --retry 15 --retry-delay 3 --retry-connrefused http://localhost:11434/api/tagsThis works on any Linux runner but adds several minutes per run. Second, cache the model files. Ollama stores models under ~/.ollama/models by default. You can cache that directory with actions/cache keyed on the model name, which reduces subsequent pull times to a cache-restore operation — usually under 30 seconds for a warm cache. The cache approach is documented in community workflows and is the most practical path for ephemeral runners.
For GPU runners on cloud providers, actuated.dev offers GPU-enabled ephemeral runners with NVIDIA driver pre-installed. That cuts driver setup time to roughly 30 seconds (cached) and keeps the security model of ephemeral environments while giving you access to the hardware Ollama needs for sub-minute inference on 7B models.
Honest Limits
Automated LLM review works best as a first filter, not a gate. A few specific limits to plan around:
A 7B model will miss logic bugs that require understanding the broader codebase context — any bug that requires tracing through three or four files is unlikely to be caught. The model only sees the diff, not the full project.
Hallucinated findings are real. The model will occasionally flag something as a bug that is intentional. Human reviewers need to treat the output as a checklist to consider, not a verdict to accept. Adding the disclaimer line to the posted comment (as in the script above) makes that expectation explicit.
Diff truncation silently degrades quality. If your PR changes 3,000 lines, the model sees only the first portion. You either need to split by file, raise the truncation limit (and accept worse performance on smaller context models), or move to a 32B+ model with a longer context window.
The model has no knowledge of your codebase conventions. It will not flag violations of internal style guides, project-specific API contracts, or patterns that are acceptable in your context but look wrong in isolation. A .github/REVIEW_GUIDELINES.md pasted into the system prompt can help — up to a point.
With those limits stated, the setup described here takes an afternoon to wire together and costs nothing ongoing if you have a machine to run it on. For teams where review bandwidth is the bottleneck, filtering out a third of the review noise before a human looks at a PR is a real productivity gain.
FAQ
Can I use GitHub-hosted runners instead of self-hosted? +
What happens if the model produces a false positive and blocks a valid PR? +
How do I keep the model from reviewing auto-generated files or lock files? +
Related reading
2026-05-21
Agetor Review: An Open-Source Kanban Board for Orchestrating Claude Code
Agetor is a 0.0.1 open-source orchestrator that pairs a Kanban board with Claude Code so you can run parallel agent tasks without juggling terminal tabs. A first look at what it does and what's planned.
2026-05-21
Veles: Hybrid BM25 + Semantic Code Search in a Local Rust MCP Server
Veles is an open-source MCP server in Rust that runs BM25 keyword search and semantic vector search together over a local index, giving Claude, Cursor, and other MCP assistants more precise code retrieval.
2026-05-21
Git for AI Agents: Version Control Built for LLM Coding Workflows
When an AI agent commits 40 times in an afternoon, git records every diff but none of the reasoning. Agent-native version control stores why each change was made, so you can bisect through agent sessions, not just diffs.
2026-05-21
Amp's Neo CLI: Why AI Coding Agents Still Live in the Terminal
Sourcegraph's Amp is reworking the command line around autonomous AI coding agents. Here's why the terminal remains core infrastructure for agentic development — and what changes when software, not a person, is the operator.
2026-05-21
Arcjet for AI Agents: Securing the Attack Surface Inside LLM Apps
Arcjet is moving its in-app security guards into AI agents, adding runtime checks against prompt injection, unsafe file reads, and risky web fetches. Here's why agentic apps need guardrails at the point of action, not just the network edge.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.