NVIDIA CUTLASS: High-Performance CUDA Templates for AI Linear Algebra
A close read of NVIDIA CUTLASS — the header-only CUDA template library behind a surprising amount of modern AI infrastructure. What it is, how its kernel hierarchy works, where CuTe and the Python DSL fit, and when to reach for it.
If you’ve trained a transformer in the last three years, your GPU spent most of its wall-clock time inside a matrix multiplication. The kernels doing that work were probably written by cuBLAS, generated by a compiler stack like Triton, or hand-assembled on top of NVIDIA’s CUTLASS templates. CUTLASS is the one most people don’t see directly, but it sits underneath a surprising amount of modern AI infrastructure — from FlashAttention to vLLM to several internal kernels inside PyTorch.
What CUTLASS actually is
CUTLASS — CUDA Templates for Linear Algebra Subroutines — is a header-only C++ template library NVIDIA publishes on GitHub under Apache 2.0. It is not a drop-in replacement for cuBLAS. cuBLAS gives you a closed-source binary with a stable API: you call cublasGemmEx and you get a tuned kernel. CUTLASS gives you the building blocks to write your own kernel, with control over tile sizes, data layouts, epilogues, and how the kernel decomposes work across the GPU’s memory hierarchy.
That control is the point. If you’re building a custom inference engine and your projection layer needs to fuse a GEMM with a SiLU activation and a residual add, cuBLAS can’t fuse the epilogue for you — you’d launch the GEMM, then a separate elementwise kernel, paying twice for global memory traffic. With CUTLASS, the epilogue is a template parameter. You write the fusion once, instantiate the template, and the compiler emits a single kernel.
This is why CUTLASS shows up wherever standard cuBLAS shapes don’t fit — unusual data types like FP8, custom epilogues, sparse or grouped GEMMs, attention-shaped matrix products. Anywhere the stock library doesn’t have what someone needs and the performance ceiling matters, you tend to find a CUTLASS kernel.
The hierarchy that makes CUTLASS work
A modern GPU is not flat. An NVIDIA H100 SXM has 132 streaming multiprocessors (SMs), each holding warps of 32 threads, with a tiered memory system spanning registers, shared memory, L2, and HBM. A well-tuned GEMM has to decompose the same multiplication problem at every level of that hierarchy and pick tile sizes that keep the tensor cores fed without spilling.
CUTLASS encodes this hierarchy directly into its type system:
- Device-level templates describe the full GEMM problem and dispatch to a kernel grid.
- Kernel-level templates describe how a single grid block divides its work.
- Threadblock-level templates describe the tile each block computes, plus the shared-memory staging pattern.
- Warp-level templates map onto tensor core MMA instructions —
mma.syncon Ampere,wgmmaon Hopper. - Thread-level templates handle per-thread accumulation and the epilogue.
Each layer takes the layer below it as a template parameter. The compiler instantiates the whole stack at build time, so you pay no virtual-dispatch overhead at runtime — the cost is build time and binary size. A non-trivial CUTLASS kernel can take tens of seconds to compile and produce a multi-megabyte object file. Teams ship CUTLASS-based libraries with ahead-of-time-generated kernels for the shapes they care about, rather than JIT-compiling per request.
The payoff is performance close to what NVIDIA’s own profiler reports as the achievable peak for a given shape, with full control over how the kernel behaves. cuBLAS will silently fall back to a generic kernel for unusual shapes; CUTLASS lets you write the specialized one and own the result.
CuTe and the Python DSL
CUTLASS 3.x, released around the Hopper launch, introduced CuTe — short for CUDA Tensors. CuTe is a lower-level tensor algebra library that replaces a lot of the hand-rolled layout math in earlier CUTLASS versions. Instead of writing pointer arithmetic and indexing logic by hand, you describe a layout as a composition of shapes and strides, and CuTe handles the rest.
If you’ve worked with Triton’s block-pointer API or with XLA’s HLO layouts, CuTe will feel familiar in spirit, but it operates at a lower level — it’s designed to give you the same control you’d have writing inline PTX, with composable abstractions instead of macros. Most new CUTLASS kernels targeting Hopper and Blackwell tensor cores are written using CuTe primitives rather than the older threadblock-level abstractions.
CUTLASS 4.x went further and added a Python DSL. You write kernels in a constrained subset of Python that JIT-compiles down to the same template stack the C++ library uses. This is aimed at researchers who want to prototype a kernel shape without setting up an NVCC build environment, and at framework authors who want to generate kernels programmatically.
When to reach for CUTLASS — and when not to
CUTLASS is the right tool when three things are true at once: you need a GEMM-shaped computation, you need control cuBLAS doesn’t expose, and you’re willing to spend the engineering time to tune kernels. If any of those is false, reach for something else.
- For standard matrix multiplication in standard data types, cuBLAS is faster to integrate and usually within a few percent of a hand-tuned CUTLASS kernel.
- For experimentation with custom kernels in Python, Triton has a gentler ramp and a much faster compile loop.
- For attention specifically, FlashAttention and similar published kernels are likely already what you’d build.
- For non-NVIDIA hardware, CUTLASS is a non-starter — it’s CUDA-only by design.
The teams that get the most out of CUTLASS are the ones building inference engines, training frameworks, or specialized kernels for novel data types — the cases where the standard library doesn’t have what you need and the gap between “close to peak” and “actually at peak” shows up in the GPU bill.
FAQ
Is CUTLASS a replacement for cuBLAS? +
Does CUTLASS work on AMD or Apple Silicon GPUs? +
What's the difference between CUTLASS and Triton? +
Related tools
Beehiiv
Newsletter platform with built-in ad network and Boost referrals.
Try Beehiiv →
Webflow
Visual site builder with real CSS export and a CMS that scales.
Try Webflow →
Some links above are affiliate links. We may earn a commission if you sign up. See our disclosure for details.
Related reading
2026-05-27
Fly.io Edge Platform Review: Deploy Apps to 37 Regions With WireGuard Networking
We deployed a Go API and Next.js app across Fly.io's edge network, measuring cold starts, regional latency, and DX against Railway, Render, and Heroku — plus WireGuard networking and fly.toml deep-dive.
2026-05-27
OrbStack Deep Review: The macOS-Native Container Runtime That Replaces Docker Desktop
We migrated 18 Docker containers from Docker Desktop to OrbStack on an M1 Max MacBook Pro — measuring memory, CPU idle, and cold starts. Review of macOS-native architecture, Docker API compat, and real-world dev performance.
2026-05-27
Temporal Deep-Dive: Durable Execution That Survives Process Death and Network Outages
We built payment processing, user onboarding, and AI orchestration on Temporal — measuring durability, replay, and SDK learning curve vs Step Functions and job queues. Review of workflow-as-code, deterministic execution, and where durable execution replaces retry logic.
2026-05-27
Turso libSQL Deep-Dive: The SQLite Fork That Ships With an Edge Replication SDK
We integrated Turso's libSQL SDK into a TypeScript analytics pipeline with embedded replicas across 3 regions — review of the architecture, replication model, and how it compares to Cloudflare D1, PlanetScale, and vanilla SQLite.
2026-05-27
Upstash Review: Serverless Redis and Kafka With Per-Request Pricing
We replaced self-hosted Redis and Kafka with Upstash's serverless offerings, measuring latency from 3 regions vs AWS ElastiCache and Confluent Cloud. Review of Redis REST API, Kafka HTTP bridge, and where per-request pricing wins.
Get the best tools, weekly
One email every Friday. No spam, unsubscribe anytime.