This note, mostly generated by Gemini, breifly explores the different compilation paths a simple line of Python like torch.matmul can take to be executed by a specific hardware module like the Tensor Core.
Understanding these different paths is key to understanding why the ML compilation stack exists, what “programmability” really means, and how developers make crucial trade-offs between ease-of-use, flexibility, and raw hardware performance.
Core Problem: The Two-Language Dilemma
The entire ML compilation stack exists to solve one problem: bridging the gap between how humans write AI code and how hardware runs it.
- Python (High-Level): Dynamic, flexible, and easy to use. Variables can change types (e.g.,
x = 10,x = "hi"). This is great for programmer but terrible for a compiler, which can’t predict what the code will do. - Hardware (Low-Level): Static, rigid, and fast. A Tensor Core is an ASIC (Application-Specific Integrated Circuit) that only does matrix math. It needs fixed, predictable instructions.
The stack’s job is to translate the dynamic, high-level code into static, low-level instructions.
Path A: The “Eager” Default Path (What happens without torch.compile)
This is the traditional, step-by-step path. It’s simple but has a major bottleneck.
The Workflow:
- You (Python): You write
torch.matmul(a, b). - ATen Backend (CPU): Your call goes to the PyTorch C++ backend (ATen), which is a pre-compiled binary (
.soor.dll) running on your CPU. - Dispatcher (CPU): ATen acts as a “dispatcher.” It sees you’re on a GPU and need a
matmul. It doesn’t write a new kernel; it makes a function call to a pre-compiled library like cuBLAS.- cuBLAS: For math (e.g.,
matmul). - cuDNN: For neural network layers (e.g.,
convolution).
- cuBLAS: For math (e.g.,
- cuBLAS (CPU): The cuBLAS library (also on the CPU) looks at your matrix shapes (e.g.,
1024x512) and uses a heuristic to select the best pre-compiledmatmulkernel (e.g., “Kernel #73”) from its collection. - PTX to SASS (CPU): This kernel is stored in PTX (NVIDIA’s assembly language). The NVIDIA Driver (CPU software) performs a Just-in-Time (JIT) compilation to translate that PTX into the final binary machine code, SASS, that is specific to your exact GPU (e.g., H100).
- CUDA Runtime: Throughout this, the ATen/cuBLAS C++ code is using the CUDA Runtime API to manage the GPU. This API is the “middle manager” that sends commands to the driver like:
cudaMalloc()(to allocate memory)cudaMemcpy()(to move data)cudaLaunchKernel()(to tell the GPU to run the SASS binary)
- Tensor Core (GPU): The GPU’s scheduler finally executes this SASS binary on its Tensor Cores to get the result.
Kernel Launch Overhead:
If your code is torch.relu(torch.matmul(a, b)), this path runs TWO full, separate kernels.
- Run
matmulkernel, write the entire intermediate result to GPU memory. - Run
relukernel, read that entire result back from memory, applyrelu, and write it out again. This memory round-trip is extremely slow.
Path B: The “Compiled” Path (Using @torch.compile)
This is the modern, high-performance path. Its entire goal is to solve the bottleneck of Path A through operator fusion.
The Workflow:
- You (Python): You add
@torch.compileto your function. - TorchDynamo (Frontend): This is a “tracer.” It runs your dynamic Python code once and captures a static graph of all the PyTorch operations.
- Symbolic Shapes: It’s smart enough to use “symbols” (like
s0for batch size) so the same compiled graph can work forbatch_size=8orbatch_size=16. - Graph Breaks: If it sees too dynamic Python (like a complex
ifstatement), it “breaks” the graph, runs that part in normal Python, and then starts a new graph.
- Symbolic Shapes: It’s smart enough to use “symbols” (like
- TorchInductor (Backend): This is the optimizer. It takes the graph from Dynamo and finds ways to make it faster.
- Operator Fusion: This is its main job. It sees the
matmul -> relugraph and decides to fuse them into one single kernel. This eliminates the memory round-trip. - Heuristics: If there are many fusion options (e.g.,
(a+b) + (c+d)), it uses a scoring system to pick the best fusion that minimizes memory I/O. - Why JIT? Combinatorial Explosion. It’s impossible to pre-compile a library of all possible fusions (
matmul+relu,matmul+bias+gelu, etc.). TorchInductor JIT-compiles the exact kernel you need, when you need it.
- Operator Fusion: This is its main job. It sees the
- Triton (Compiler): TorchInductor uses Triton to generate this new, fused kernel.
- Triton is a language (using Python syntax) and a compiler.
- It is NOT CUDA C++. The Triton compiler uses LLVM (a general-purpose compiler toolchain) to generate PTX directly.
- Driver & GPU: From here, the path is the same as before. The driver takes the new, custom-fused PTX code from Triton, compiles it to SASS, and the GPU runs it.
Main Benefit: Performance & Portability
- Performance: You get 95% of the speed of a hand-tuned kernel for free by eliminating memory I/O.
- Portability:
torch.compileis a standard “frontend.”TorchInductoris a “pluggable backend.” It can use Triton for NVIDIA GPUs, or a C++ backend for CPUs, or (in the future) a Triton backend for AMD GPUs.
Path C: The “Hand-Tuned” Kernel Path
This is the “speed-of-light” path. It’s not a path most developers take, but it’s crucial for understanding the performance landscape.
- What it is: Instead of using an automated tool like
torch.compile, a team of expert engineers will manually write a bespoke CUDA kernel (or equivalent for other hardware). - The Goal: This kernel is perfectly, painstakingly optimized for one specific model (like DeepSeek) running on one specific piece of hardware (like an H100 GPU).
The Performance vs. Effort Trade-off
This is the core dilemma for high-performance computing.
- Path B (
torch.compile): Gives you ~95% of the maximum possible performance for ~0% engineering effort. It’s fast, flexible, and automated. It provides the best performance-per-engineer-hour. - Path C (Hand-Tuned): Lets you chase that final 5-10% of performance, but at an astronomical engineering cost.
Why Do It? When you are a company operating at a massive scale, that “tiny” 5% speedup on your main model saves millions of dollars in hardware and energy costs. The engineering cost is worth it.
It’s not for everyone
- Insane Engineering Cost: It requires a team of world-class kernel engineers months to write and debug a single kernel.
- Extremely “Brittle”: The kernel is only good for that exact situation. It becomes sub-optimal or completely breaks if you:
- Change the model architecture (e.g., DeepSeek-v2).
- Change the hardware (e.g., from an H100 to a B200).
- Even just change the batch size or input shape.
For 99.9% of all developers and researchers, Path B is the clear winner. It makes your code fast without requiring you to become an elite GPU kernel developer.