A Practical Python Tutorial

CPU
vs
GPU

Why some code runs 100× faster on a graphics card — and how to write it.

▼ scroll to learn
01 — Architecture

The fundamental difference

Both CPU and GPU are processors, but they were designed for completely different jobs. Understanding that design is the entire explanation for why one crushes the other at certain tasks.

CPU

Central Processing Unit

Designed for latency-sensitive, serial work — execute a single complex task as fast as possible.

Cores4 – 64
Clock speed3 – 5 GHz
Cache (L3)16 – 256 MB
Branch predictionDeep / aggressive
Best atSerial logic
GPU

Graphics Processing Unit

Designed for throughput — massive parallelism — run thousands of simple tasks simultaneously.

Cores (CUDA)1,000 – 16,000+
Clock speed1.5 – 2.5 GHz
VRAM bandwidth~2 TB/s (H100)
Branch predictionMinimal
Best atData parallelism
🏎️

The racing analogy

A CPU is a Formula 1 car — a handful of incredibly fast, smart, independently capable engines. A GPU is a freight train with thousands of small engines — each one simple, but together they move an enormous amount simultaneously. If your job is to carry 10,000 identical boxes, you want the train.

02 — Parallelism

Visualizing core counts

Each square below represents one execution core. Notice how many more workers a GPU fields.

CPU — 8 cores
GPU — 4,096 CUDA cores (e.g. NVIDIA A100 has 6,912)
Key insight: GPU cores are not smarter — they are simpler. Each one only excels when given the same instruction applied to different data (SIMD — Single Instruction, Multiple Data). If your algorithm has lots of branching if/else logic, the GPU's advantage shrinks dramatically.
03 — The Core Example

Matrix multiplication: 1 operation, 2 worlds

Multiplying two N×N matrices requires N³ multiply-add operations. For N = 4,096 that's ~68 billion operations. This is exactly the workload that defines deep learning — and why GPUs dominate it.

❶ Pure Python (baseline)

Explicit nested loops. Every iteration is a separate Python bytecode dispatch. Painfully slow.

matmul_python.py
import time
import random

N = 512  # keep small — this is O(N³)

# Build two NxN matrices filled with random floats
A = [[random.random() for _ in range(N)] for _ in range(N)]
B = [[random.random() for _ in range(N)] for _ in range(N)]
C = [[0.0]  * N       for _ in range(N)]

t0 = time.perf_counter()

for i in range(N):
    for j in range(N):
        for k in range(N):
            C[i][j] += A[i][k] * B[k][j]

elapsed = time.perf_counter() - t0
print(f"Pure Python: {elapsed:.2f}s  (N={N})")
# Typical output: ~70–120 seconds for N=512

❷ NumPy on CPU

NumPy delegates to BLAS/LAPACK — highly optimized C/Fortran routines. Single call, vectorized internally, uses all CPU cores via OpenBLAS.

matmul_numpy.py
import numpy as np
import time

N = 4096  # now we can afford a much bigger matrix

A = np.random.rand(N, N).astype(np.float32)
B = np.random.rand(N, N).astype(np.float32)

# warm-up (first call can be slower due to BLAS init)
_ = np.dot(A[:64,:64], B[:64,:64])

t0 = time.perf_counter()
C = np.matmul(A, B)           # one line — BLAS handles everything
elapsed = time.perf_counter() - t0

print(f"NumPy (CPU): {elapsed*1000:.1f} ms  (N={N})")
# Typical output: ~600–1200 ms on a modern 8-core CPU

❸ PyTorch on GPU (CUDA)

Move the data to VRAM once, call torch.matmul. PyTorch dispatches cuBLAS — NVIDIA's GPU-tuned BLAS. The GPU runs thousands of multiplications in parallel using Tensor Cores.

matmul_gpu.py
import torch
import time

N = 4096

# ── host (CPU) tensors ──────────────────────────────────
A_cpu = torch.rand(N, N, dtype=torch.float32)
B_cpu = torch.rand(N, N, dtype=torch.float32)

# ── move to device (GPU VRAM) ───────────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
A = A_cpu.to(device)
B = B_cpu.to(device)

# warm-up — first kernel launch has JIT overhead
_ = torch.matmul(A[:64,:64], B[:64,:64])
torch.cuda.synchronize()   # wait for GPU to finish

t0 = time.perf_counter()
C = torch.matmul(A, B)
torch.cuda.synchronize()   # ← CRITICAL: GPU is async, must sync before timing
elapsed = time.perf_counter() - t0

print(f"PyTorch GPU: {elapsed*1000:.1f} ms  (N={N})")
# Typical output: ~4–12 ms on a modern GPU (RTX 3090 / A100)
# That's ~100× faster than NumPy CPU

Relative timings — N=4096 matrix multiply

🐍 Pure Python loops ~70,000,000 ms equivalent
CPU — NumPy (BLAS) ~800 ms
GPU — PyTorch (cuBLAS) ~8 ms

* Bar width is normalized to show relative time (longer = slower). Pure Python off chart.

04 — CPU Parallelisation

Squeezing more out of the CPU

Before reaching for a GPU, there's a lot of headroom left on a multi-core CPU. The pure-Python triple loop uses exactly one core. Here's how to use all of them — and what stops you.

The GIL — Python's hidden serialiser

CPython has a Global Interpreter Lock: only one thread can execute Python bytecode at a time. This means threading.Thread gives you concurrency (interleaved execution) but not true parallelism for CPU-bound loops. Threads help with I/O waits, not arithmetic.

threading — misleading

Threads share the GIL

Spawning 8 threads on a pure-Python loop gives you no speedup — the GIL serialises them. May even be slower due to lock contention overhead.

Cores used1 (effective)
MemoryShared
Good forI/O, not compute
ProcessPoolExecutor — the fix

Processes bypass the GIL

Each worker is a separate Python interpreter with its own GIL. True parallelism — one worker per core. Cost: data must be pickled and sent to each worker.

Cores usedAll N cores
MemoryCopied per worker
Good forCPU-bound compute

Why rows? The independence argument

Each row i of the output matrix C depends only on row i of A and all of B. Rows are completely independent of each other — a textbook case of embarrassingly parallel work.

ROW → CORE ASSIGNMENT (N=512, 8 cores)

Each column = one core  ·  Each cell = one row of C computed by that core

❶ threading — looks parallel, isn't

matmul_threads.py
import threading, time, random

N = 128
A = [[random.random() for _ in range(N)] for _ in range(N)]
B = [[random.random() for _ in range(N)] for _ in range(N)]
C = [[0.0] * N for _ in range(N)]

def compute_row(i):
    for j in range(N):
        s = 0.0
        for k in range(N):
            s += A[i][k] * B[k][j]
        C[i][j] = s

t0 = time.perf_counter()
threads = [threading.Thread(target=compute_row, args=(i,)) for i in range(N)]
for t in threads: t.start()
for t in threads: t.join()
print(f"Threading: {time.perf_counter()-t0:.2f}s")
# ⚠ No speedup — GIL serialises all threads on pure-Python loops
# May even be SLOWER than serial due to lock contention

❷ ProcessPoolExecutor — true multi-core

matmul_multiprocess.py
import os, time, random
from concurrent.futures import ProcessPoolExecutor

N = 256
A = [[random.random() for _ in range(N)] for _ in range(N)]
B = [[random.random() for _ in range(N)] for _ in range(N)]

# Each worker gets ONE row index + the full A, B matrices
# B is pickled and sent to every worker — the main overhead
def compute_row(args):
    i, A_w, B_w, n = args
    row = [0.0] * n
    for j in range(n):
        s = 0.0
        for k in range(n):
            s += A_w[i][k] * B_w[k][j]
        row[j] = s
    return i, row

n_cores = os.cpu_count()   # use ALL cores on this machine
print(f"Launching {n_cores} workers")

t0 = time.perf_counter()
C = [[0.0] * N for _ in range(N)]

with ProcessPoolExecutor(max_workers=n_cores) as ex:
    args = [(i, A, B, N) for i in range(N)]
    # chunksize batches rows to reduce pickling round-trips
    for i, row in ex.map(compute_row, args, chunksize=N//n_cores):
        C[i] = row

elapsed = time.perf_counter() - t0
print(f"ProcessPool ({n_cores} cores): {elapsed:.2f}s")
# Typical speedup: ~(n_cores - 1)× after pickling overhead
# 8 cores → ~5–6× faster than serial pure-Python

❸ NumPy — the right answer for one machine

When you call np.matmul(A, B), Python hands off to OpenBLAS (or Intel MKL), a library written in hand-tuned C and assembly. OpenBLAS does two things your Python loop never could: it tiles the matrix to fit the CPU cache, and it spawns OS threads (not Python threads — real pthreads that bypass the GIL entirely) to work on tiles in parallel.

🍕

In plain English

Imagine you need to multiply two giant paper sheets together — each one covered in millions of numbers. Your pure-Python loop is one person who walks to the sheet, reads one number, walks back to their desk to calculate, then walks back again for the next — one cell at a time, on foot, every single time.

OpenBLAS does three smarter things. First, it cuts each sheet into small sections that can be carried back to your desk in one trip — so you're not walking back and forth for every number (that's tiling: keeping data in cache, not fetching from RAM each time). Second, it assigns each section to a different person, all working at the same time — no waiting for anyone else to finish (that's one OS thread per core). Third, each person uses a calculator that does 8 multiplications per button-press instead of one (that's SIMD/AVX).

The numbers are the same. The arithmetic is the same. But with many people, small sections, and fast calculators — it's done in a fraction of the time, and Python never had to know any of it happened.

STEP 1 — TILE

Slice into cache-sized blocks

The matrix is cut into tiles ~256×256 (tuned to fit L2 cache). A core can load one tile, crunch it fully, and never stall waiting for RAM.

STEP 2 — THREAD

Assign tile rows to OS threads

OpenBLAS spawns N pthreads (one per core). Each thread owns a horizontal band of tiles in C — no coordination needed mid-compute.

STEP 3 — SIMD

Vectorise inside each tile

Within each tile, AVX2/AVX-512 instructions multiply 8–16 floats per clock cycle simultaneously. This is the innermost speedup — happening inside a single core.

HOW OPENBLAS TILES & ASSIGNS TO CORES

Matrix A
×
Matrix B
=
Matrix C (output)
Why this beats your Python ProcessPool: No pickling. No subprocess startup. Threads share the same memory — B is read once from RAM, cached in L3, and all cores read their tiles directly. The Python GIL doesn't apply because OpenBLAS is pure C executing below Python's runtime.
matmul_numpy_parallel.py
import numpy as np
import time

# See what BLAS backend numpy is using
np.show_config()   # look for openblas_info or blas_mkl_info

N = 4096
A = np.random.rand(N, N).astype(np.float32)
B = np.random.rand(N, N).astype(np.float32)

t0 = time.perf_counter()
C = np.matmul(A, B)
# ↑ Internally: OpenBLAS tiles A&B, spawns N pthreads,
#   each thread works a band of tiles with AVX2 SIMD.
#   No GIL. No pickling. All in C/assembly.
elapsed = time.perf_counter() - t0
print(f"NumPy matmul: {elapsed*1000:.1f} ms")

# Control how many cores OpenBLAS uses:
# OMP_NUM_THREADS=4 python matmul_numpy_parallel.py
# Or at runtime:
import os
os.environ["OMP_NUM_THREADS"] = "4"   # must be set BEFORE importing numpy

Multiple CPUs (sockets) on one machine

High-end servers have 2–4 physical CPU sockets, each a separate chip with its own cores and L3 cache, connected via a high-speed interconnect (AMD Infinity Fabric, Intel UPI). Accessing memory attached to a different socket incurs a latency penalty — this is called NUMA (Non-Uniform Memory Access).

🏢

The office floor analogy

CPU 0's cores are workers on floor 1 — they share a filing cabinet (L3 cache) right next to their desks. CPU 1's cores are on floor 2 with their own cabinet. If a floor-1 worker needs a file from floor 2's cabinet, they must take the lift — that's the NUMA penalty, typically 2–4× slower than local memory access. The fix: pin each process to one socket and keep its data local.

numa_aware.py
# ── Check NUMA topology (Linux) ──────────────────────────
# $ numactl --hardware
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7      node 0 size: 64338 MB
# node 1 cpus: 8 9 10 11 12 13 14 15 node 1 size: 64486 MB

# ── Pin a process to socket 0 only ──────────────────────
# $ numactl --cpunodebind=0 --membind=0 python my_script.py

# ── In Python: split work across sockets explicitly ─────
import os
from concurrent.futures import ProcessPoolExecutor

CORES_PER_SOCKET = 8   # adjust for your hardware
N_SOCKETS        = 2

# Give socket 0 the top half of rows, socket 1 the bottom half
# Each ProcessPoolExecutor is limited to one socket's cores
# so the OS keeps memory access local

def run_socket(socket_id, row_start, row_end):
    # Pin this process to the right cores (Linux)
    core_start = socket_id * CORES_PER_SOCKET
    core_end   = core_start + CORES_PER_SOCKET - 1
    os.sched_setaffinity(0, range(core_start, core_end + 1))
    # ... compute rows row_start:row_end using those cores ...
    pass

# In practice, use numactl at the shell level — simpler and more reliable
# $ numactl --cpunodebind=0 python worker.py --rows 0:2048 &
# $ numactl --cpunodebind=1 python worker.py --rows 2048:4096 &

Multiple machines — distributed computing

Once you cross machine boundaries, shared memory is gone. Workers communicate over a network. Two dominant Python options:

Ray — Python-native

Easiest to write

Decorate functions with @ray.remote. Ray handles scheduling, serialisation, and fault tolerance. Best for ML pipelines and heterogeneous tasks.

OverheadMedium
Fault toleranceBuilt-in
mpi4py — HPC classic

Lowest overhead

Explicit send()/recv() calls. Near-zero overhead on InfiniBand interconnects. Standard in supercomputing. Steep learning curve.

OverheadMinimal
Fault toleranceManual
matmul_ray_distributed.py
import ray
import numpy as np

# ray.init() on a laptop uses local cores
# ray.init(address="auto") connects to a Ray cluster
ray.init()

N = 4096
A = np.random.rand(N, N).astype(np.float32)
B = np.random.rand(N, N).astype(np.float32)

# Put B in the shared object store ONCE — all workers read it
# without a separate copy per worker (unlike ProcessPoolExecutor)
B_ref = ray.put(B)

@ray.remote
def compute_block(A_block, B_ref):
    B_local = ray.get(B_ref)
    return A_block @ B_local   # NumPy BLAS on each node

# Split A into row-blocks — one task per available node/core
n_workers = int(ray.cluster_resources().get("CPU", 4))
block     = N // n_workers

futures = [
    compute_block.remote(A[i : i+block], B_ref)
    for i in range(0, N, block)
]

C_blocks = ray.get(futures)          # collect results
C = np.vstack(C_blocks)             # reassemble full matrix
print(f"C shape: {C.shape}")         # (4096, 4096)

# Key design: B broadcast once, A partitioned by rows
# No inter-worker communication during compute — only at scatter/gather

Decision table — which CPU strategy?

ScenarioToolWhy
Pure-Python loop, 1 machine ProcessPoolExecutor Bypasses GIL; one process per core
NumPy arrays, 1 machine np.matmul (just use it) OpenBLAS already uses all cores; faster than manual multiprocessing
Multi-socket server, memory-sensitive numactl + ProcessPool Pin workers to sockets to avoid NUMA penalty
Multiple machines, ML workloads Ray Easy shared object store; handles scheduling automatically
Multiple machines, HPC / MPI cluster mpi4py Lowest overhead; standard in scientific computing
threading for compute Avoid GIL prevents true parallelism in pure-Python compute
The honest hierarchy: For matrix math on one machine, np.matmul beats hand-parallelised Python every time — OpenBLAS is written in hand-tuned assembly. Python-level multiprocessing only wins when your work can't be expressed as vectorised NumPy operations.
05 — Mechanics

Why is GPU actually faster?

Three compounding factors drive the speedup:

Factor 1

Massive parallelism

Each output cell C[i,j] = sum(A[i,:] × B[:,j]) is independent of every other cell. A GPU with 6,912 cores computes 6,912 cells simultaneously. The CPU does 8 at a time.

Factor 2

Memory bandwidth

VRAM bandwidth on an A100 is ~2 TB/s. System RAM bandwidth is ~50–80 GB/s. Matrix multiply is memory-bound — raw bandwidth translates directly to speed.

Factor 3

Tensor Cores

Modern NVIDIA GPUs have dedicated Tensor Core units that compute a 4×4 matrix multiply in a single clock cycle. They're specifically wired for the A×B+C pattern that defines every neural network layer.

CPU bottleneck

Sequential overhead

Even with BLAS, the CPU must orchestrate memory reads, cache misses, and core coordination. For large matrices the data simply can't reach the cores fast enough from RAM.

The GPU contract: You give up flexibility (no complex branching, all threads must execute the same instruction) and you accept a fixed data transfer overhead (CPU → GPU copy). In return, you get a 10–1000× throughput multiplier on the compute itself.
06 — Workload Guide

What can (and can't) be made faster?

The GPU is a tool, not a universal accelerator. Apply it where data parallelism naturally exists.

Workload GPU? Why
Deep learning training (PyTorch, TensorFlow) Excellent Matrix multiplies everywhere; batches of independent samples
Large matrix / vector ops (NumPy → CuPy) Excellent Element-wise and reduction ops are perfectly parallel
Monte Carlo simulation (many independent paths) Very good Paths are independent → launch one thread per path
Image / signal processing (convolution, FFT) Very good Each pixel/sample computed independently; cuFFT is tuned
Graph algorithms (BFS, PageRank) Good Works well with sparse libraries (cuSPARSE, cuGraph)
Database queries / SQL Good Aggregations over large tables (see cuDF / RAPIDS)
LP/MIP solvers (Gurobi, OR-Tools) Limited Branch-and-bound is inherently serial; GPU helps preprocessing only
Web servers / API handlers No I/O bound, low compute, lots of branching logic
Recursive algorithms (DFS, dynamic programming) No Each step depends on the previous; parallelism impossible
Very small arrays (N < ~1000) Often no Data-transfer overhead > compute savings; CPU wins
07 — In Practice

Practical Python patterns

You don't need to write CUDA kernels. The ecosystem has done that for you.

CuPy — drop-in NumPy

Replace import numpy as np with import cupy as cp. API is nearly identical. Arrays live in VRAM.

🔥

PyTorch tensors

All operations on .to("cuda") tensors run on GPU automatically. The autograd engine also runs there.

🐼

cuDF / RAPIDS

GPU-accelerated DataFrames. Same pandas API, but groupby/merge/sort on millions of rows runs in milliseconds.

🧮

Numba @cuda.jit

Write custom GPU kernels in Python with @cuda.jit. Best for irregular workloads that libraries don't cover.

🔬

Profile first

Use torch.profiler or nsight to find the true bottleneck. Often it's data loading, not compute.

📦

Minimize transfers

CPU↔GPU copies are expensive. Move data once, do all work on-device, bring results back at the end only.

CuPy example — Monte Carlo π estimation

monte_carlo_pi.py
import numpy as np
import cupy  as cp   # pip install cupy-cuda12x
import time

N = 100_000_000  # 100 million samples

# ── CPU version ─────────────────────────────────────────
t0 = time.perf_counter()
x, y   = np.random.rand(N), np.random.rand(N)
inside = (x**2 + y**2) <= 1.0
pi_cpu = 4 * inside.mean()
t_cpu  = time.perf_counter() - t0

# ── GPU version ─────────────────────────────────────────
t0 = time.perf_counter()
x, y   = cp.random.rand(N), cp.random.rand(N)  # generated in VRAM
inside = (x**2 + y**2) <= 1.0
pi_gpu = 4 * inside.mean()
cp.cuda.Stream.null.synchronize()               # wait for GPU
t_gpu  = time.perf_counter() - t0

print(f"CPU π ≈ {float(pi_cpu):.6f}  in {t_cpu:.2f}s")
print(f"GPU π ≈ {float(pi_gpu):.6f}  in {t_gpu:.3f}s")
print(f"Speedup: {t_cpu/t_gpu:.1f}×")
# Typical: CPU ~3.5s, GPU ~0.08s → ~40× speedup
08 — Mental Model

The one question to ask

"Can I apply the same operation
to a million independent things at once?"

If yes → GPU will likely help.
If no → stick with a fast CPU library.

The cost model is: transfer penalty + kernel launch overhead vs compute savings. For large, regular, embarrassingly parallel problems the savings dwarf the overhead. For small, branchy, sequential problems the overhead wins and you go slower.

Most real-world acceleration in ML and scientific computing comes not from writing CUDA directly, but from restructuring your problem as batched tensor operations and letting libraries like PyTorch, CuPy, or RAPIDS handle the GPU dispatch.