Best CUDA Books for Learning GPU Programming in 2026
A review of nine CUDA programming books — which hold up against the CUDA 12 toolkit and Hopper architecture, which are out of date, and a working reading order to go from zero to writing your own kernels.
GPU programming sits in an awkward spot: NVIDIA’s official docs are exhaustive but assume you already understand the execution model, while most blog tutorials stop at “vector addition” and leave you guessing how shared memory, warp scheduling, and tensor cores actually fit together. Books fill that gap — if you pick the right ones.
We worked through nine widely-recommended CUDA titles and cross-referenced public reading lists like awesome-cuda-books to figure out which still hold up against the CUDA 12 toolkit, Hopper architecture, and the way most developers actually reach CUDA today — through PyTorch, Triton, or CuPy. The short answer: two books cover most of what you need, and the rest are situational.
Start here: the two books that haven’t aged
Programming Massively Parallel Processors by David Kirk and Wen-mei Hwu is the textbook the field grew up on. The 4th edition (Morgan Kaufmann, 2022) is the one to buy — earlier editions predate unified memory and modern scheduling, and the rewrite is substantial, not cosmetic. It teaches you to think in terms of the hardware: thread blocks mapping to SMs, memory coalescing, occupancy, warp divergence. The pacing assumes a CS background but not prior parallel experience. If you only read one CUDA book, this is it.
Professional CUDA C Programming by John Cheng, Max Grossman, and Ty McKercher (Wrox, 2014) is older but still one of the clearest books on profiling and optimization. Chapters on Nsight, occupancy tuning, and stream concurrency translate directly to current tooling — the APIs they show are still in CUDA 12. Skip the sections on Kepler-specific quirks; everything else applies.
Going deeper: reference and architecture
Once you have the execution model in your head, two reference-style books pay off.
The CUDA Handbook by Nicholas Wilt (Addison-Wesley, 2013) is the closest thing to an API-level reference in book form. Wilt worked on the CUDA driver at NVIDIA, and it shows — the chapters on streams, events, and the driver vs. runtime API answer questions that the official docs cover only obliquely. The book is dated on hardware, but the driver-level material has barely changed.
Programming in Parallel with CUDA: A Practical Guide by Richard Ansorge (Cambridge University Press, 2022) is the newest book on this list and the only one that consistently uses C++17 and CUDA 11+ idioms throughout. Ansorge writes for scientific computing readers, so the worked examples lean toward stencils, FFTs, and Monte Carlo — useful if you’re moving simulation code to GPU, less useful if your end goal is custom kernels for a PyTorch model.
Learn CUDA Programming by Jaegeun Han and Bharatkumar Sharma (Packt, 2019) covers Volta and Turing including tensor cores, mixed precision, and cuDNN integration. It’s the most ML-adjacent of the general-purpose books, though some of its NCCL and DGX content has been superseded by newer NVIDIA whitepapers.
If you’re coming from Python and ML
Most developers in 2026 hit CUDA through PyTorch custom ops, Triton kernels, or CuPy — not through writing raw .cu files from scratch. The book economy hasn’t fully caught up to this.
Hands-On GPU Programming with Python and CUDA by Brian Tuomanen (Packt, 2019) is the only one of these books written for that path. It’s PyCUDA-centric and parts feel dated (it predates Triton entirely), but the chapters on kernel templating, ctypes interop, and debugging GPU code from Python are still useful as a bridge.
If your endgame is custom PyTorch ops, the realistic reading order is: Kirk & Hwu for the execution model, then PyTorch’s CUDA extension docs, then OpenAI’s Triton documentation, then CUTLASS examples on GitHub. A single book won’t get you there.
Cursor
Pairing a CUDA textbook with an AI editor that can read your .cu files alongside the book's examples cuts the time between 'I read it' and 'I ran it' to roughly nothing. Worth it for the first few chapters of Kirk & Hwu alone.
Free tier; Pro $20/mo
Affiliate link · We earn a commission at no cost to you.
What to skip, and a working reading order
CUDA by Example by Sanders and Kandrot (2010) shows up on most lists and is the gentlest introduction, but it’s well over a decade old and predates unified memory, cooperative groups, and most of what makes modern CUDA modern. The first three chapters are fine as a one-evening orientation; after that, you’re learning patterns you’ll have to unlearn. CUDA Programming by Shane Cook (2013) is comprehensive but verbose, and its advice on memory hierarchies is now misleading for any card released after Pascal. Worth borrowing, not buying.
A reading order that actually works:
- Kirk & Hwu, 4th ed. — chapters 1–6 over two weeks, doing every exercise on a real GPU.
- NVIDIA’s CUDA C++ Programming Guide — sections 1–5, alongside the book.
- Professional CUDA C Programming — chapters on Nsight and optimization, skipping the architecture chapters.
- Ansorge or Wilt depending on goal (scientific computing vs. systems-level).
- Tuomanen only if you’re sticking with Python interop.
Budget roughly 60–80 hours of focused work to get from zero to writing a non-trivial kernel that actually beats a well-tuned library call. Most of that time is profiling, not coding.
FAQ
Is Kirk and Hwu's 4th edition worth buying if I own the 3rd? +
Do any of these books cover Triton or CUTLASS? +
Is it still worth learning raw CUDA in 2026 if I work mostly in PyTorch? +
Related reading
2026-05-26
Orthrus: Parallel Token Generation That Doesn't Change Your Model's Output
Orthrus injects diffusion attention into each layer of a frozen autoregressive Transformer to generate 32 tokens in parallel — without altering the base model's output distribution.
2026-05-26
NVIDIA Warp Review: GPU-Accelerated Python for Simulation, Robotics, and Differentiable ML
NVIDIA Warp compiles Python functions to CUDA kernels for differentiable physics and robotics. We benchmarked it against JAX and Taichi to figure out when it earns a spot in your stack.
2026-05-26
OpenAI Daybreak vs Anthropic Glasswing: Convergent Bets on LLM Security Tooling
OpenAI's Daybreak (GPT-5.5 + Codex Security) and Anthropic's Glasswing shipped near-identical AppSec products the same week. What the convergence means and how to pick.
2026-05-26
Macchiato Day 2: Live Token Metrics and Parallel AI Terminals Reviewed
Macchiato's day-2 build adds a live token/cost sidebar and keyboard shortcuts for swapping between Claude Code and OpenCode in one terminal. Here's what shipped and what it means.
2026-05-26
Macchiato Day 2: Live Token Metrics and Parallel Terminals for Claude Code and OpenCode
Macchiato Day 2 adds a 2-4 pane terminal grid, live token and cost meters, and configurable spend ceilings for Claude Code and OpenCode sessions. Here is what it actually does and who should install it.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.