Why some code runs 100× faster on a graphics card — and how to write it.
Both CPU and GPU are processors, but they were designed for completely different jobs. Understanding that design is the entire explanation for why one crushes the other at certain tasks.
Designed for latency-sensitive, serial work — execute a single complex task as fast as possible.
Designed for throughput — massive parallelism — run thousands of simple tasks simultaneously.
A CPU is a Formula 1 car — a handful of incredibly fast, smart, independently capable engines. A GPU is a freight train with thousands of small engines — each one simple, but together they move an enormous amount simultaneously. If your job is to carry 10,000 identical boxes, you want the train.
Each square below represents one execution core. Notice how many more workers a GPU fields.
if/else logic, the GPU's advantage shrinks dramatically.
Multiplying two N×N matrices requires N³ multiply-add operations. For N = 4,096 that's ~68 billion operations. This is exactly the workload that defines deep learning — and why GPUs dominate it.
Explicit nested loops. Every iteration is a separate Python bytecode dispatch. Painfully slow.
import time import random N = 512 # keep small — this is O(N³) # Build two NxN matrices filled with random floats A = [[random.random() for _ in range(N)] for _ in range(N)] B = [[random.random() for _ in range(N)] for _ in range(N)] C = [[0.0] * N for _ in range(N)] t0 = time.perf_counter() for i in range(N): for j in range(N): for k in range(N): C[i][j] += A[i][k] * B[k][j] elapsed = time.perf_counter() - t0 print(f"Pure Python: {elapsed:.2f}s (N={N})") # Typical output: ~70–120 seconds for N=512
NumPy delegates to BLAS/LAPACK — highly optimized C/Fortran routines. Single call, vectorized internally, uses all CPU cores via OpenBLAS.
import numpy as np import time N = 4096 # now we can afford a much bigger matrix A = np.random.rand(N, N).astype(np.float32) B = np.random.rand(N, N).astype(np.float32) # warm-up (first call can be slower due to BLAS init) _ = np.dot(A[:64,:64], B[:64,:64]) t0 = time.perf_counter() C = np.matmul(A, B) # one line — BLAS handles everything elapsed = time.perf_counter() - t0 print(f"NumPy (CPU): {elapsed*1000:.1f} ms (N={N})") # Typical output: ~600–1200 ms on a modern 8-core CPU
Move the data to VRAM once, call torch.matmul. PyTorch dispatches cuBLAS — NVIDIA's GPU-tuned BLAS. The GPU runs thousands of multiplications in parallel using Tensor Cores.
import torch import time N = 4096 # ── host (CPU) tensors ────────────────────────────────── A_cpu = torch.rand(N, N, dtype=torch.float32) B_cpu = torch.rand(N, N, dtype=torch.float32) # ── move to device (GPU VRAM) ─────────────────────────── device = "cuda" if torch.cuda.is_available() else "cpu" A = A_cpu.to(device) B = B_cpu.to(device) # warm-up — first kernel launch has JIT overhead _ = torch.matmul(A[:64,:64], B[:64,:64]) torch.cuda.synchronize() # wait for GPU to finish t0 = time.perf_counter() C = torch.matmul(A, B) torch.cuda.synchronize() # ← CRITICAL: GPU is async, must sync before timing elapsed = time.perf_counter() - t0 print(f"PyTorch GPU: {elapsed*1000:.1f} ms (N={N})") # Typical output: ~4–12 ms on a modern GPU (RTX 3090 / A100) # That's ~100× faster than NumPy CPU
* Bar width is normalized to show relative time (longer = slower). Pure Python off chart.
Before reaching for a GPU, there's a lot of headroom left on a multi-core CPU. The pure-Python triple loop uses exactly one core. Here's how to use all of them — and what stops you.
CPython has a Global Interpreter Lock: only one thread can execute Python bytecode at a time.
This means threading.Thread gives you concurrency (interleaved execution) but not true parallelism
for CPU-bound loops. Threads help with I/O waits, not arithmetic.
Spawning 8 threads on a pure-Python loop gives you no speedup — the GIL serialises them. May even be slower due to lock contention overhead.
Each worker is a separate Python interpreter with its own GIL. True parallelism — one worker per core. Cost: data must be pickled and sent to each worker.
Each row i of the output matrix C depends only on row i of A
and all of B. Rows are completely independent of each other — a textbook case of
embarrassingly parallel work.
import threading, time, random N = 128 A = [[random.random() for _ in range(N)] for _ in range(N)] B = [[random.random() for _ in range(N)] for _ in range(N)] C = [[0.0] * N for _ in range(N)] def compute_row(i): for j in range(N): s = 0.0 for k in range(N): s += A[i][k] * B[k][j] C[i][j] = s t0 = time.perf_counter() threads = [threading.Thread(target=compute_row, args=(i,)) for i in range(N)] for t in threads: t.start() for t in threads: t.join() print(f"Threading: {time.perf_counter()-t0:.2f}s") # ⚠ No speedup — GIL serialises all threads on pure-Python loops # May even be SLOWER than serial due to lock contention
import os, time, random from concurrent.futures import ProcessPoolExecutor N = 256 A = [[random.random() for _ in range(N)] for _ in range(N)] B = [[random.random() for _ in range(N)] for _ in range(N)] # Each worker gets ONE row index + the full A, B matrices # B is pickled and sent to every worker — the main overhead def compute_row(args): i, A_w, B_w, n = args row = [0.0] * n for j in range(n): s = 0.0 for k in range(n): s += A_w[i][k] * B_w[k][j] row[j] = s return i, row n_cores = os.cpu_count() # use ALL cores on this machine print(f"Launching {n_cores} workers") t0 = time.perf_counter() C = [[0.0] * N for _ in range(N)] with ProcessPoolExecutor(max_workers=n_cores) as ex: args = [(i, A, B, N) for i in range(N)] # chunksize batches rows to reduce pickling round-trips for i, row in ex.map(compute_row, args, chunksize=N//n_cores): C[i] = row elapsed = time.perf_counter() - t0 print(f"ProcessPool ({n_cores} cores): {elapsed:.2f}s") # Typical speedup: ~(n_cores - 1)× after pickling overhead # 8 cores → ~5–6× faster than serial pure-Python
When you call np.matmul(A, B), Python hands off to OpenBLAS (or Intel MKL),
a library written in hand-tuned C and assembly. OpenBLAS does two things your Python loop never could:
it tiles the matrix to fit the CPU cache, and it spawns OS threads (not Python threads —
real pthreads that bypass the GIL entirely) to work on tiles in parallel.
Imagine you need to multiply two giant paper sheets together — each one covered in millions of numbers.
Your pure-Python loop is one person who walks to the sheet, reads one number, walks back to their desk
to calculate, then walks back again for the next — one cell at a time, on foot, every single time.
OpenBLAS does three smarter things. First, it cuts each sheet into small sections that can be carried
back to your desk in one trip — so you're not walking back and forth for every number
(that's tiling: keeping data in cache, not fetching from RAM each time).
Second, it assigns each section to a different person, all working at the same time —
no waiting for anyone else to finish (that's one OS thread per core).
Third, each person uses a calculator that does 8 multiplications per button-press
instead of one (that's SIMD/AVX).
The numbers are the same. The arithmetic is the same. But with many people, small sections,
and fast calculators — it's done in a fraction of the time, and Python never had to know any of it happened.
The matrix is cut into tiles ~256×256 (tuned to fit L2 cache). A core can load one tile, crunch it fully, and never stall waiting for RAM.
OpenBLAS spawns N pthreads (one per core). Each thread owns a horizontal band of tiles in C — no coordination needed mid-compute.
Within each tile, AVX2/AVX-512 instructions multiply 8–16 floats per clock cycle simultaneously. This is the innermost speedup — happening inside a single core.
import numpy as np import time # See what BLAS backend numpy is using np.show_config() # look for openblas_info or blas_mkl_info N = 4096 A = np.random.rand(N, N).astype(np.float32) B = np.random.rand(N, N).astype(np.float32) t0 = time.perf_counter() C = np.matmul(A, B) # ↑ Internally: OpenBLAS tiles A&B, spawns N pthreads, # each thread works a band of tiles with AVX2 SIMD. # No GIL. No pickling. All in C/assembly. elapsed = time.perf_counter() - t0 print(f"NumPy matmul: {elapsed*1000:.1f} ms") # Control how many cores OpenBLAS uses: # OMP_NUM_THREADS=4 python matmul_numpy_parallel.py # Or at runtime: import os os.environ["OMP_NUM_THREADS"] = "4" # must be set BEFORE importing numpy
High-end servers have 2–4 physical CPU sockets, each a separate chip with its own cores and L3 cache, connected via a high-speed interconnect (AMD Infinity Fabric, Intel UPI). Accessing memory attached to a different socket incurs a latency penalty — this is called NUMA (Non-Uniform Memory Access).
CPU 0's cores are workers on floor 1 — they share a filing cabinet (L3 cache) right next to their desks. CPU 1's cores are on floor 2 with their own cabinet. If a floor-1 worker needs a file from floor 2's cabinet, they must take the lift — that's the NUMA penalty, typically 2–4× slower than local memory access. The fix: pin each process to one socket and keep its data local.
# ── Check NUMA topology (Linux) ────────────────────────── # $ numactl --hardware # available: 2 nodes (0-1) # node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 64338 MB # node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 64486 MB # ── Pin a process to socket 0 only ────────────────────── # $ numactl --cpunodebind=0 --membind=0 python my_script.py # ── In Python: split work across sockets explicitly ───── import os from concurrent.futures import ProcessPoolExecutor CORES_PER_SOCKET = 8 # adjust for your hardware N_SOCKETS = 2 # Give socket 0 the top half of rows, socket 1 the bottom half # Each ProcessPoolExecutor is limited to one socket's cores # so the OS keeps memory access local def run_socket(socket_id, row_start, row_end): # Pin this process to the right cores (Linux) core_start = socket_id * CORES_PER_SOCKET core_end = core_start + CORES_PER_SOCKET - 1 os.sched_setaffinity(0, range(core_start, core_end + 1)) # ... compute rows row_start:row_end using those cores ... pass # In practice, use numactl at the shell level — simpler and more reliable # $ numactl --cpunodebind=0 python worker.py --rows 0:2048 & # $ numactl --cpunodebind=1 python worker.py --rows 2048:4096 &
Once you cross machine boundaries, shared memory is gone. Workers communicate over a network. Two dominant Python options:
Decorate functions with @ray.remote. Ray handles scheduling, serialisation, and fault tolerance. Best for ML pipelines and heterogeneous tasks.
Explicit send()/recv() calls. Near-zero overhead on InfiniBand interconnects. Standard in supercomputing. Steep learning curve.
import ray import numpy as np # ray.init() on a laptop uses local cores # ray.init(address="auto") connects to a Ray cluster ray.init() N = 4096 A = np.random.rand(N, N).astype(np.float32) B = np.random.rand(N, N).astype(np.float32) # Put B in the shared object store ONCE — all workers read it # without a separate copy per worker (unlike ProcessPoolExecutor) B_ref = ray.put(B) @ray.remote def compute_block(A_block, B_ref): B_local = ray.get(B_ref) return A_block @ B_local # NumPy BLAS on each node # Split A into row-blocks — one task per available node/core n_workers = int(ray.cluster_resources().get("CPU", 4)) block = N // n_workers futures = [ compute_block.remote(A[i : i+block], B_ref) for i in range(0, N, block) ] C_blocks = ray.get(futures) # collect results C = np.vstack(C_blocks) # reassemble full matrix print(f"C shape: {C.shape}") # (4096, 4096) # Key design: B broadcast once, A partitioned by rows # No inter-worker communication during compute — only at scatter/gather
| Scenario | Tool | Why |
|---|---|---|
| Pure-Python loop, 1 machine | ProcessPoolExecutor | Bypasses GIL; one process per core |
| NumPy arrays, 1 machine | np.matmul (just use it) | OpenBLAS already uses all cores; faster than manual multiprocessing |
| Multi-socket server, memory-sensitive | numactl + ProcessPool | Pin workers to sockets to avoid NUMA penalty |
| Multiple machines, ML workloads | Ray | Easy shared object store; handles scheduling automatically |
| Multiple machines, HPC / MPI cluster | mpi4py | Lowest overhead; standard in scientific computing |
| threading for compute | Avoid | GIL prevents true parallelism in pure-Python compute |
np.matmul beats hand-parallelised Python every time — OpenBLAS is written in hand-tuned assembly. Python-level multiprocessing only wins when your work can't be expressed as vectorised NumPy operations.
Three compounding factors drive the speedup:
Each output cell C[i,j] = sum(A[i,:] × B[:,j]) is independent of every other cell. A GPU with 6,912 cores computes 6,912 cells simultaneously. The CPU does 8 at a time.
VRAM bandwidth on an A100 is ~2 TB/s. System RAM bandwidth is ~50–80 GB/s. Matrix multiply is memory-bound — raw bandwidth translates directly to speed.
Modern NVIDIA GPUs have dedicated Tensor Core units that compute a 4×4 matrix multiply in a single clock cycle. They're specifically wired for the A×B+C pattern that defines every neural network layer.
Even with BLAS, the CPU must orchestrate memory reads, cache misses, and core coordination. For large matrices the data simply can't reach the cores fast enough from RAM.
The GPU is a tool, not a universal accelerator. Apply it where data parallelism naturally exists.
| Workload | GPU? | Why |
|---|---|---|
| Deep learning training (PyTorch, TensorFlow) | Excellent | Matrix multiplies everywhere; batches of independent samples |
| Large matrix / vector ops (NumPy → CuPy) | Excellent | Element-wise and reduction ops are perfectly parallel |
| Monte Carlo simulation (many independent paths) | Very good | Paths are independent → launch one thread per path |
| Image / signal processing (convolution, FFT) | Very good | Each pixel/sample computed independently; cuFFT is tuned |
| Graph algorithms (BFS, PageRank) | Good | Works well with sparse libraries (cuSPARSE, cuGraph) |
| Database queries / SQL | Good | Aggregations over large tables (see cuDF / RAPIDS) |
| LP/MIP solvers (Gurobi, OR-Tools) | Limited | Branch-and-bound is inherently serial; GPU helps preprocessing only |
| Web servers / API handlers | No | I/O bound, low compute, lots of branching logic |
| Recursive algorithms (DFS, dynamic programming) | No | Each step depends on the previous; parallelism impossible |
| Very small arrays (N < ~1000) | Often no | Data-transfer overhead > compute savings; CPU wins |
You don't need to write CUDA kernels. The ecosystem has done that for you.
Replace import numpy as np with import cupy as cp. API is nearly identical. Arrays live in VRAM.
All operations on .to("cuda") tensors run on GPU automatically. The autograd engine also runs there.
GPU-accelerated DataFrames. Same pandas API, but groupby/merge/sort on millions of rows runs in milliseconds.
Write custom GPU kernels in Python with @cuda.jit. Best for irregular workloads that libraries don't cover.
Use torch.profiler or nsight to find the true bottleneck. Often it's data loading, not compute.
CPU↔GPU copies are expensive. Move data once, do all work on-device, bring results back at the end only.
import numpy as np import cupy as cp # pip install cupy-cuda12x import time N = 100_000_000 # 100 million samples # ── CPU version ───────────────────────────────────────── t0 = time.perf_counter() x, y = np.random.rand(N), np.random.rand(N) inside = (x**2 + y**2) <= 1.0 pi_cpu = 4 * inside.mean() t_cpu = time.perf_counter() - t0 # ── GPU version ───────────────────────────────────────── t0 = time.perf_counter() x, y = cp.random.rand(N), cp.random.rand(N) # generated in VRAM inside = (x**2 + y**2) <= 1.0 pi_gpu = 4 * inside.mean() cp.cuda.Stream.null.synchronize() # wait for GPU t_gpu = time.perf_counter() - t0 print(f"CPU π ≈ {float(pi_cpu):.6f} in {t_cpu:.2f}s") print(f"GPU π ≈ {float(pi_gpu):.6f} in {t_gpu:.3f}s") print(f"Speedup: {t_cpu/t_gpu:.1f}×") # Typical: CPU ~3.5s, GPU ~0.08s → ~40× speedup
"Can I apply the same operation
to a million independent things at once?"
If yes → GPU will likely help.
If no → stick with a fast CPU library.
The cost model is: transfer penalty + kernel launch overhead vs compute savings. For large, regular, embarrassingly parallel problems the savings dwarf the overhead. For small, branchy, sequential problems the overhead wins and you go slower.
Most real-world acceleration in ML and scientific computing comes not from writing CUDA directly, but from restructuring your problem as batched tensor operations and letting libraries like PyTorch, CuPy, or RAPIDS handle the GPU dispatch.