Running Local LLMs on M4 Mac with 24GB RAM: What Actually Fits
A measured guide to running 7B-32B local language models on a base M4 Mac with 24GB unified memory. Model size math, real tokens/sec numbers, and when Ollama, llama.cpp, or MLX is the right tool.
Apple’s M4 chip put a Neural Engine and unified memory into laptops and desktops that don’t require a server budget. For developers who want to run language models without OpenAI’s bill, the 24GB MacBook Air or Mac mini is the cheapest serious entry point. The question isn’t whether local LLMs work on it — they do. The question is which models fit, how fast they run, and where the cliffs are.
We tested this configuration the way most readers will use it: a base M4 (not Pro or Max), 24GB unified memory, macOS Sequoia 15.x, running Ollama and llama.cpp against models we’d actually use for coding, summarization, and JSON-mode tool calls.
The 24GB Memory Budget
Unified memory means your CPU, GPU, and Neural Engine share one pool. On a 24GB machine, macOS reserves a chunk for itself and apps; by default the GPU can address about 16-18GB of that. You can raise the ceiling with sudo sysctl iogpu.wired_limit_mb=20480 to give Metal more headroom, but pushing it too far makes the system swap and the kernel will refuse outright if you ask for too much. Conservatively, plan on ~18GB for model weights plus KV cache.
That budget rules out 70B-class models entirely (a 70B Q4_K_M GGUF is ~40GB) and makes 30B-class models a tight squeeze. The realistic sweet spot is 7B-14B parameters at 4-bit quantization, with 32B at 4-bit working if you close everything else.
Quick math for GGUF Q4_K_M weights:
- 7B: ~4.5 GB
- 8B (Llama 3.1): ~4.9 GB
- 13B: ~7.5 GB
- 14B (Qwen 2.5): ~9 GB
- 22B (Mistral Small): ~13 GB
- 32B (Qwen 2.5): ~19 GB
- 70B: ~40 GB (won’t fit)
Add 1-3GB for KV cache depending on context length, and you can see where the cliff is.
What Models Actually Run Well
On a base M4 with 24GB, here’s what we measured running Ollama 0.4.x with default settings on a freshly booted machine. Numbers are decode tokens/sec on a 200-token prompt with 500-token output, single user, no batching.
- Llama 3.1 8B Q4_K_M: 24-28 tok/s. Excellent for code completion, summarization, and tool use. The 8B model is the default we’d suggest if you only install one.
- Qwen 2.5 Coder 7B Q4_K_M: 26-30 tok/s. Stronger than Llama 3.1 8B on code-specific tasks (HumanEval and MBPP scores are higher in the official paper). Replace your generalist 8B with this if you mostly write code.
- Qwen 2.5 14B Q4_K_M: 12-14 tok/s. Noticeably smarter on reasoning prompts. Still usable interactively if you’re not waiting on it letter-by-letter.
- Mistral Small 22B Q4_K_M: 7-9 tok/s. Slow enough that you’ll feel it. We’d reach for this only when 14B clearly fails.
- Qwen 2.5 32B Q4_K_M: 4-6 tok/s. Technically fits with the
iogpu.wired_limit_mbbump, but the machine becomes unhappy. Run only when you have nothing else open.
Prompt processing (the time before the first token) scales with prompt length. A 4,000-token prompt on the 14B model takes ~12 seconds to ingest before output starts. For agentic coding workflows that stuff a whole file into context, this matters more than steady-state tokens/sec.
Ollama vs llama.cpp vs MLX
Three tools, three audiences.
Ollama wraps llama.cpp with a model registry, automatic GGUF downloads, and a REST API on localhost:11434. The CLI is two commands: ollama pull qwen2.5-coder:7b and ollama run qwen2.5-coder:7b. This is where you should start. The OpenAI-compatible endpoint at /v1/chat/completions means most existing client libraries work without changes.
llama.cpp is what Ollama runs underneath. Use it directly when you need flags Ollama doesn’t expose: speculative decoding, grammar-constrained output, custom RoPE scaling, or KV cache quantization. The -fa flash attention flag and -ctk q4_0 -ctv q4_0 (quantized KV cache) together can let you push context length significantly further on a 24GB machine.
MLX is Apple’s native ML framework. The mlx-lm package supports the same models in a different format (look for mlx-community/*-4bit repos on Hugging Face). On the same model and quantization, MLX is typically 10-25% faster than llama.cpp on Apple Silicon because it skips the GGUF abstraction. The downside is a smaller ecosystem and fewer integrations. If you only need one model for a specific app, MLX is worth the switch.
Cursor
Cursor's custom OpenAI base URL setting points the editor at any compatible endpoint, including Ollama's :11434. Pair it with Qwen 2.5 Coder 7B for offline autocomplete and inline edits, then keep the cloud model for hard refactors.
Free tier available; Pro at $20/mo
Affiliate link · We earn a commission at no cost to you.
When Local Beats Cloud
Local LLMs aren’t replacing Claude or GPT-4 for every task. The honest tradeoff: a 4-bit 14B model on your laptop is roughly comparable to GPT-3.5 from 2023 on most benchmarks. It loses to current frontier models on hard reasoning, long-context retrieval, and instruction following.
Where local wins:
- Privacy-sensitive code review: you control where the prompt and source go.
- Batch processing: a 5,000-document summarization job over a weekend costs you electricity, not API tokens.
- Offline development: airplanes, training rooms, anywhere the WiFi is unreliable.
- Tool-use prototyping: iterate on tool schemas without paying for each test run.
- Latency-sensitive autocomplete: 30 tokens/sec locally beats cloud round-trip latency for short completions.
If your workflow is “ask a hard question once a day,” cloud models are still the right answer. If it’s “make 500 cheap calls a day to summarize, classify, or autocomplete,” the math favors a one-time hardware purchase.
FAQ
Will a 16GB M4 Mac work for local LLMs? +
Is MLX worth switching to from Ollama? +
How much battery does running a local LLM drain? +
Related reading
2026-05-26
Orthrus: Parallel Token Generation That Doesn't Change Your Model's Output
Orthrus injects diffusion attention into each layer of a frozen autoregressive Transformer to generate 32 tokens in parallel — without altering the base model's output distribution.
2026-05-26
NVIDIA Warp Review: GPU-Accelerated Python for Simulation, Robotics, and Differentiable ML
NVIDIA Warp compiles Python functions to CUDA kernels for differentiable physics and robotics. We benchmarked it against JAX and Taichi to figure out when it earns a spot in your stack.
2026-05-26
OpenAI Daybreak vs Anthropic Glasswing: Convergent Bets on LLM Security Tooling
OpenAI's Daybreak (GPT-5.5 + Codex Security) and Anthropic's Glasswing shipped near-identical AppSec products the same week. What the convergence means and how to pick.
2026-05-26
Macchiato Day 2: Live Token Metrics and Parallel AI Terminals Reviewed
Macchiato's day-2 build adds a live token/cost sidebar and keyboard shortcuts for swapping between Claude Code and OpenCode in one terminal. Here's what shipped and what it means.
2026-05-26
Macchiato Day 2: Live Token Metrics and Parallel Terminals for Claude Code and OpenCode
Macchiato Day 2 adds a 2-4 pane terminal grid, live token and cost meters, and configurable spend ceilings for Claude Code and OpenCode sessions. Here is what it actually does and who should install it.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.