pickuma.
AI & Dev Tools

Running Local LLMs on M4 Mac with 24GB RAM: What Actually Fits

A measured guide to running 7B-32B local language models on a base M4 Mac with 24GB unified memory. Model size math, real tokens/sec numbers, and when Ollama, llama.cpp, or MLX is the right tool.

6 min read

Apple’s M4 chip put a Neural Engine and unified memory into laptops and desktops that don’t require a server budget. For developers who want to run language models without OpenAI’s bill, the 24GB MacBook Air or Mac mini is the cheapest serious entry point. The question isn’t whether local LLMs work on it — they do. The question is which models fit, how fast they run, and where the cliffs are.

We tested this configuration the way most readers will use it: a base M4 (not Pro or Max), 24GB unified memory, macOS Sequoia 15.x, running Ollama and llama.cpp against models we’d actually use for coding, summarization, and JSON-mode tool calls.

The 24GB Memory Budget

Unified memory means your CPU, GPU, and Neural Engine share one pool. On a 24GB machine, macOS reserves a chunk for itself and apps; by default the GPU can address about 16-18GB of that. You can raise the ceiling with sudo sysctl iogpu.wired_limit_mb=20480 to give Metal more headroom, but pushing it too far makes the system swap and the kernel will refuse outright if you ask for too much. Conservatively, plan on ~18GB for model weights plus KV cache.

That budget rules out 70B-class models entirely (a 70B Q4_K_M GGUF is ~40GB) and makes 30B-class models a tight squeeze. The realistic sweet spot is 7B-14B parameters at 4-bit quantization, with 32B at 4-bit working if you close everything else.

Quick math for GGUF Q4_K_M weights:

  • 7B: ~4.5 GB
  • 8B (Llama 3.1): ~4.9 GB
  • 13B: ~7.5 GB
  • 14B (Qwen 2.5): ~9 GB
  • 22B (Mistral Small): ~13 GB
  • 32B (Qwen 2.5): ~19 GB
  • 70B: ~40 GB (won’t fit)

Add 1-3GB for KV cache depending on context length, and you can see where the cliff is.

What Models Actually Run Well

On a base M4 with 24GB, here’s what we measured running Ollama 0.4.x with default settings on a freshly booted machine. Numbers are decode tokens/sec on a 200-token prompt with 500-token output, single user, no batching.

  • Llama 3.1 8B Q4_K_M: 24-28 tok/s. Excellent for code completion, summarization, and tool use. The 8B model is the default we’d suggest if you only install one.
  • Qwen 2.5 Coder 7B Q4_K_M: 26-30 tok/s. Stronger than Llama 3.1 8B on code-specific tasks (HumanEval and MBPP scores are higher in the official paper). Replace your generalist 8B with this if you mostly write code.
  • Qwen 2.5 14B Q4_K_M: 12-14 tok/s. Noticeably smarter on reasoning prompts. Still usable interactively if you’re not waiting on it letter-by-letter.
  • Mistral Small 22B Q4_K_M: 7-9 tok/s. Slow enough that you’ll feel it. We’d reach for this only when 14B clearly fails.
  • Qwen 2.5 32B Q4_K_M: 4-6 tok/s. Technically fits with the iogpu.wired_limit_mb bump, but the machine becomes unhappy. Run only when you have nothing else open.

Prompt processing (the time before the first token) scales with prompt length. A 4,000-token prompt on the 14B model takes ~12 seconds to ingest before output starts. For agentic coding workflows that stuff a whole file into context, this matters more than steady-state tokens/sec.

Ollama vs llama.cpp vs MLX

Three tools, three audiences.

Ollama wraps llama.cpp with a model registry, automatic GGUF downloads, and a REST API on localhost:11434. The CLI is two commands: ollama pull qwen2.5-coder:7b and ollama run qwen2.5-coder:7b. This is where you should start. The OpenAI-compatible endpoint at /v1/chat/completions means most existing client libraries work without changes.

llama.cpp is what Ollama runs underneath. Use it directly when you need flags Ollama doesn’t expose: speculative decoding, grammar-constrained output, custom RoPE scaling, or KV cache quantization. The -fa flash attention flag and -ctk q4_0 -ctv q4_0 (quantized KV cache) together can let you push context length significantly further on a 24GB machine.

MLX is Apple’s native ML framework. The mlx-lm package supports the same models in a different format (look for mlx-community/*-4bit repos on Hugging Face). On the same model and quantization, MLX is typically 10-25% faster than llama.cpp on Apple Silicon because it skips the GGUF abstraction. The downside is a smaller ecosystem and fewer integrations. If you only need one model for a specific app, MLX is worth the switch.

Cursor

Cursor's custom OpenAI base URL setting points the editor at any compatible endpoint, including Ollama's :11434. Pair it with Qwen 2.5 Coder 7B for offline autocomplete and inline edits, then keep the cloud model for hard refactors.

Free tier available; Pro at $20/mo

Try Cursor

Affiliate link · We earn a commission at no cost to you.

When Local Beats Cloud

Local LLMs aren’t replacing Claude or GPT-4 for every task. The honest tradeoff: a 4-bit 14B model on your laptop is roughly comparable to GPT-3.5 from 2023 on most benchmarks. It loses to current frontier models on hard reasoning, long-context retrieval, and instruction following.

Where local wins:

  • Privacy-sensitive code review: you control where the prompt and source go.
  • Batch processing: a 5,000-document summarization job over a weekend costs you electricity, not API tokens.
  • Offline development: airplanes, training rooms, anywhere the WiFi is unreliable.
  • Tool-use prototyping: iterate on tool schemas without paying for each test run.
  • Latency-sensitive autocomplete: 30 tokens/sec locally beats cloud round-trip latency for short completions.

If your workflow is “ask a hard question once a day,” cloud models are still the right answer. If it’s “make 500 cheap calls a day to summarize, classify, or autocomplete,” the math favors a one-time hardware purchase.

FAQ

Will a 16GB M4 Mac work for local LLMs? +
Yes for 7B-8B models at 4-bit, but you'll be tight. macOS plus a browser eats 6-8GB, leaving roughly 8GB for the GPU. An 8B Q4 model fits with limited context length. 16GB is workable for experimentation; 24GB is the practical floor for serious use, and 36GB or 48GB Pro/Max configurations give you headroom for 14B-32B models.
Is MLX worth switching to from Ollama? +
For exploration, no. Ollama's model registry and OpenAI-compatible API save real time. For a production app where you've picked one model and want the last 15% of throughput, MLX is worth the migration. The mlx-lm CLI mirrors most of what Ollama does.
How much battery does running a local LLM drain? +
Continuous inference pegs the GPU and pulls roughly 25-40W on base M4. Expect 2-3 hours of battery life if a model is hot-loaded and serving requests. Idle Ollama with no active generation costs almost nothing — by default it unloads models after 5 minutes of inactivity.