pickuma.
AI & Dev Tools

Best CUDA Books for Learning GPU Programming in 2026

A review of nine CUDA programming books — which hold up against the CUDA 12 toolkit and Hopper architecture, which are out of date, and a working reading order to go from zero to writing your own kernels.

6 min read

GPU programming sits in an awkward spot: NVIDIA’s official docs are exhaustive but assume you already understand the execution model, while most blog tutorials stop at “vector addition” and leave you guessing how shared memory, warp scheduling, and tensor cores actually fit together. Books fill that gap — if you pick the right ones.

We worked through nine widely-recommended CUDA titles and cross-referenced public reading lists like awesome-cuda-books to figure out which still hold up against the CUDA 12 toolkit, Hopper architecture, and the way most developers actually reach CUDA today — through PyTorch, Triton, or CuPy. The short answer: two books cover most of what you need, and the rest are situational.

Start here: the two books that haven’t aged

Programming Massively Parallel Processors by David Kirk and Wen-mei Hwu is the textbook the field grew up on. The 4th edition (Morgan Kaufmann, 2022) is the one to buy — earlier editions predate unified memory and modern scheduling, and the rewrite is substantial, not cosmetic. It teaches you to think in terms of the hardware: thread blocks mapping to SMs, memory coalescing, occupancy, warp divergence. The pacing assumes a CS background but not prior parallel experience. If you only read one CUDA book, this is it.

Professional CUDA C Programming by John Cheng, Max Grossman, and Ty McKercher (Wrox, 2014) is older but still one of the clearest books on profiling and optimization. Chapters on Nsight, occupancy tuning, and stream concurrency translate directly to current tooling — the APIs they show are still in CUDA 12. Skip the sections on Kepler-specific quirks; everything else applies.

Going deeper: reference and architecture

Once you have the execution model in your head, two reference-style books pay off.

The CUDA Handbook by Nicholas Wilt (Addison-Wesley, 2013) is the closest thing to an API-level reference in book form. Wilt worked on the CUDA driver at NVIDIA, and it shows — the chapters on streams, events, and the driver vs. runtime API answer questions that the official docs cover only obliquely. The book is dated on hardware, but the driver-level material has barely changed.

Programming in Parallel with CUDA: A Practical Guide by Richard Ansorge (Cambridge University Press, 2022) is the newest book on this list and the only one that consistently uses C++17 and CUDA 11+ idioms throughout. Ansorge writes for scientific computing readers, so the worked examples lean toward stencils, FFTs, and Monte Carlo — useful if you’re moving simulation code to GPU, less useful if your end goal is custom kernels for a PyTorch model.

Learn CUDA Programming by Jaegeun Han and Bharatkumar Sharma (Packt, 2019) covers Volta and Turing including tensor cores, mixed precision, and cuDNN integration. It’s the most ML-adjacent of the general-purpose books, though some of its NCCL and DGX content has been superseded by newer NVIDIA whitepapers.

If you’re coming from Python and ML

Most developers in 2026 hit CUDA through PyTorch custom ops, Triton kernels, or CuPy — not through writing raw .cu files from scratch. The book economy hasn’t fully caught up to this.

Hands-On GPU Programming with Python and CUDA by Brian Tuomanen (Packt, 2019) is the only one of these books written for that path. It’s PyCUDA-centric and parts feel dated (it predates Triton entirely), but the chapters on kernel templating, ctypes interop, and debugging GPU code from Python are still useful as a bridge.

If your endgame is custom PyTorch ops, the realistic reading order is: Kirk & Hwu for the execution model, then PyTorch’s CUDA extension docs, then OpenAI’s Triton documentation, then CUTLASS examples on GitHub. A single book won’t get you there.

Cursor

Pairing a CUDA textbook with an AI editor that can read your .cu files alongside the book's examples cuts the time between 'I read it' and 'I ran it' to roughly nothing. Worth it for the first few chapters of Kirk & Hwu alone.

Free tier; Pro $20/mo

Try Cursor

Affiliate link · We earn a commission at no cost to you.

What to skip, and a working reading order

CUDA by Example by Sanders and Kandrot (2010) shows up on most lists and is the gentlest introduction, but it’s well over a decade old and predates unified memory, cooperative groups, and most of what makes modern CUDA modern. The first three chapters are fine as a one-evening orientation; after that, you’re learning patterns you’ll have to unlearn. CUDA Programming by Shane Cook (2013) is comprehensive but verbose, and its advice on memory hierarchies is now misleading for any card released after Pascal. Worth borrowing, not buying.

A reading order that actually works:

  1. Kirk & Hwu, 4th ed. — chapters 1–6 over two weeks, doing every exercise on a real GPU.
  2. NVIDIA’s CUDA C++ Programming Guide — sections 1–5, alongside the book.
  3. Professional CUDA C Programming — chapters on Nsight and optimization, skipping the architecture chapters.
  4. Ansorge or Wilt depending on goal (scientific computing vs. systems-level).
  5. Tuomanen only if you’re sticking with Python interop.

Budget roughly 60–80 hours of focused work to get from zero to writing a non-trivial kernel that actually beats a well-tuned library call. Most of that time is profiling, not coding.

FAQ

Is Kirk and Hwu's 4th edition worth buying if I own the 3rd? +
Yes if you target Volta or newer. The 4th edition rewrites the memory and scheduling chapters around unified memory and the modern execution model, and adds material on tensor cores and cooperative groups that the 3rd edition does not cover.
Do any of these books cover Triton or CUTLASS? +
No. Both are too new for the current book economy. The official OpenAI Triton tutorials and the CUTLASS examples directory on GitHub are the primary references, and they assume you already understand CUDA fundamentals — which is where the books pay off.
Is it still worth learning raw CUDA in 2026 if I work mostly in PyTorch? +
Yes for two cases: writing custom ops where existing kernels are a bottleneck, and reading the source of libraries you depend on. For everything else, Triton plus a working understanding of the execution model from one textbook is usually enough.

Related reading

See all AI & Dev Tools articles →

Get the best tools, weekly

One email every Friday. No spam, unsubscribe anytime.