Automate Python Code Reviews with Free Local LLMs and GitHub Actions
Wire an open-weight model running in Ollama into a GitHub Actions workflow to get automated first-pass code-review comments on Python pull requests — no API bill required.
Paying for GPT-4o or Claude API calls every time someone opens a pull request adds up quickly on a busy repo. A self-hosted Ollama instance on a machine you already own — or a GPU-enabled self-hosted GitHub Actions runner — lets you run a capable open-weight model for the cost of electricity. The result is a first-pass automated review that catches common Python issues and leaves a comment on the PR before any human reads the diff.
This is not a replacement for human review. An open-weight 7B model running locally will miss subtle concurrency bugs, architectural problems, and context it has never seen. What it reliably does is reduce the amount of low-signal noise a human reviewer has to wade through: undocumented parameters, obvious type mismatches, functions that shadow builtins, missing error handling in obvious paths. That alone is worth setting up if your team is small and review time is scarce.
The Shape of the Workflow
The basic loop has four parts: a GitHub Actions workflow triggers on pull_request, a Python script fetches the diff via the GitHub REST API, the script sends that diff to a locally-running Ollama server, and the response comes back as a PR review comment posted through the same API.
Here is a minimal workflow file for a self-hosted runner that has Ollama already installed and the model pre-pulled:
name: LLM Code Review
on: pull_request: types: [opened, synchronize] paths: - '**.py'
jobs: review: runs-on: self-hosted # requires GPU runner with Ollama installed permissions: pull-requests: write contents: read steps: - uses: actions/checkout@v4
- name: Wait for Ollama run: curl --retry 10 --retry-delay 2 --retry-connrefused http://localhost:11434/api/tags
- name: Run review script env: GH_TOKEN: ${{ secrets.GITHUB_TOKEN }} PR_NUMBER: ${{ github.event.pull_request.number }} REPO: ${{ github.repository }} MODEL: qwen2.5-coder:7b run: python scripts/llm_review.pyThe paths filter limits runs to PRs that touch Python files, which avoids burning runner time on documentation-only changes. If your runner is not persistent (for example, you spin it up on demand), remove the Wait for Ollama step and replace it with the Ollama install script before the review step.
The Python review script does three things: fetch the diff, prompt the model, post the comment. Here is a stripped-down version:
import os, json, textwrapimport urllib.request
GITHUB_API = "https://api.github.com"OLLAMA_URL = "http://localhost:11434/api/generate"
def gh(path, method="GET", body=None): token = os.environ["GH_TOKEN"] req = urllib.request.Request( f"{GITHUB_API}{path}", data=json.dumps(body).encode() if body else None, headers={ "Authorization": f"Bearer {token}", "Accept": "application/vnd.github+json", "X-GitHub-Api-Version": "2022-11-28", }, method=method, ) with urllib.request.urlopen(req) as r: return json.loads(r.read())
def get_diff(): repo = os.environ["REPO"] pr = os.environ["PR_NUMBER"] req = urllib.request.Request( f"{GITHUB_API}/repos/{repo}/pulls/{pr}", headers={ "Authorization": f"Bearer {os.environ['GH_TOKEN']}", "Accept": "application/vnd.github.v3.diff", }, ) with urllib.request.urlopen(req) as r: return r.read().decode()
def ask_ollama(diff): prompt = textwrap.dedent(f""" You are a Python code reviewer. Review the following git diff for: - Bugs or likely runtime errors - Missing or incorrect type annotations - Functions that shadow Python builtins - Missing error handling in obvious paths - Style issues that violate PEP 8
Be concise. List specific findings only. Do not repeat the diff back. If the change looks correct, say so briefly.
DIFF: {diff[:12000]} """) payload = {"model": os.environ["MODEL"], "prompt": prompt, "stream": False} req = urllib.request.Request( OLLAMA_URL, data=json.dumps(payload).encode(), headers={"Content-Type": "application/json"}, method="POST", ) with urllib.request.urlopen(req, timeout=300) as r: return json.loads(r.read())["response"]
def post_comment(body): repo = os.environ["REPO"] pr = os.environ["PR_NUMBER"] gh(f"/repos/{repo}/issues/{pr}/comments", method="POST", body={"body": body})
if __name__ == "__main__": diff = get_diff() if not diff.strip(): print("Empty diff, skipping.") else: review = ask_ollama(diff) post_comment(f"**LLM first-pass review** (model: `{os.environ['MODEL']}`)\n\n{review}\n\n---\n*Automated review. Not a substitute for human review.*")The diff is truncated at 12,000 characters before being sent to the model. For a 7B model with a 4K–8K context window, sending a 40-file diff wholesale will silently truncate or produce incoherent output. The 12,000-character ceiling keeps the prompt within a safe range for 7B models while still covering most single-feature PRs. For larger diffs, you can split by file and send one prompt per changed file, then aggregate.
Choosing a Model
Three models are worth considering for this specific task. The tradeoffs map directly to the RAM available on your runner.
qwen2.5-coder:7b is the practical default. It runs in approximately 6–7 GB of VRAM or RAM, fits on a consumer GPU (RTX 3060 or similar), and performs well on Python-focused tasks. Alibaba’s Qwen2.5-Coder series was explicitly trained on code, which matters more for targeted review work than general instruction-following ability.
mistral:7b is an acceptable alternative if you already have it pulled or if you want a model with stronger general-language generation for more verbose review comments. It is not specifically trained on code, so it will miss some language-specific patterns that a coder model catches, but its instruction-following is reliable.
qwen2.5-coder:32b or similar 30B+ models produce noticeably better reviews — they can reason about multi-file interactions and catch subtler bugs — but require roughly 22–24 GB of VRAM. That pushes you toward A100 or multi-GPU setups, which changes the cost calculus significantly.
For a first deployment, start with qwen2.5-coder:7b. You can upgrade the model string in the workflow env var without touching anything else.
Self-Hosted Runners and the Cold-Start Problem
If you run Ollama on a persistent self-hosted runner — a spare workstation, a homelab server, or a cloud VM you control — the model stays in memory between runs and job startup time drops to a few seconds. The runner registers with GitHub via github.com/settings/actions/runners and picks up jobs like any other runner.
The cold-start problem appears when you do not have a persistent machine. In that case, you have two options. First, install Ollama and pull the model at the start of every job:
- name: Install Ollama run: curl -fsSL https://ollama.com/install.sh | sh
- name: Pull model run: ollama pull qwen2.5-coder:7b &
- name: Start Ollama server run: ollama serve &
- name: Wait for server run: curl --retry 15 --retry-delay 3 --retry-connrefused http://localhost:11434/api/tagsThis works on any Linux runner but adds several minutes per run. Second, cache the model files. Ollama stores models under ~/.ollama/models by default. You can cache that directory with actions/cache keyed on the model name, which reduces subsequent pull times to a cache-restore operation — usually under 30 seconds for a warm cache. The cache approach is documented in community workflows and is the most practical path for ephemeral runners.
For GPU runners on cloud providers, actuated.dev offers GPU-enabled ephemeral runners with NVIDIA driver pre-installed. That cuts driver setup time to roughly 30 seconds (cached) and keeps the security model of ephemeral environments while giving you access to the hardware Ollama needs for sub-minute inference on 7B models.
Honest Limits
Automated LLM review works best as a first filter, not a gate. A few specific limits to plan around:
A 7B model will miss logic bugs that require understanding the broader codebase context — any bug that requires tracing through three or four files is unlikely to be caught. The model only sees the diff, not the full project.
Hallucinated findings are real. The model will occasionally flag something as a bug that is intentional. Human reviewers need to treat the output as a checklist to consider, not a verdict to accept. Adding the disclaimer line to the posted comment (as in the script above) makes that expectation explicit.
Diff truncation silently degrades quality. If your PR changes 3,000 lines, the model sees only the first portion. You either need to split by file, raise the truncation limit (and accept worse performance on smaller context models), or move to a 32B+ model with a longer context window.
The model has no knowledge of your codebase conventions. It will not flag violations of internal style guides, project-specific API contracts, or patterns that are acceptable in your context but look wrong in isolation. A .github/REVIEW_GUIDELINES.md pasted into the system prompt can help — up to a point.
With those limits stated, the setup described here takes an afternoon to wire together and costs nothing ongoing if you have a machine to run it on. For teams where review bandwidth is the bottleneck, filtering out a third of the review noise before a human looks at a PR is a real productivity gain.
FAQ
Can I use GitHub-hosted runners instead of self-hosted? +
What happens if the model produces a false positive and blocks a valid PR? +
How do I keep the model from reviewing auto-generated files or lock files? +
Related reading
2026-05-27
Bolt.new vs. Lovable: Two AI App Builders, Two Very Different Philosophies
I built the same project in both Bolt.new and Lovable to compare the two leading prompt-to-app platforms. The differences in code quality, iteration speed, and deployment experience reveal which tool fits which kind of project.
2026-05-27
Replit Agent Review: The Cloud IDE That Turns Prompts Into Deployed Apps
Replit Agent combines AI coding, instant deployment, and multiplayer collaboration into a browser-based IDE. I spent three weeks building and deploying apps entirely from prompts to see whether the agent-first experience delivers on its promise.
2026-05-27
Sourcegraph Cody Review: When Your Codebase Is Too Big for Copilot
Sourcegraph Cody indexes your entire codebase and uses that context for AI completions, chat, and code generation. I tested it on a 2.6-million-line monorepo to see whether codebase-aware AI solves the problems that generic assistants miss.
2026-05-27
Tabnine Review 2026: The Veteran AI Code Assistant Gets a Modern Rewrite
Tabnine has been doing AI code completion since 2018, longer than almost anyone. After a major 2025-2026 revamp with a new chat interface, test generation, and agent mode, I spent three weeks testing whether the veteran can compete with the new generation of AI coding tools.
2026-05-27
v0 by Vercel Review: AI-Generated React Components That Actually Ship
v0 generates production-grade React components with shadcn/ui, Tailwind CSS, and TypeScript. I tested it across 15 real UI tasks to see whether AI-generated components hold up under actual product requirements.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.