GPU Arc Consistency

Arc consistency is a simple rule: every value in a variable's domain must have at least one valid partner in each neighbour's domain. This is an exploration of what it actually takes to run that rule on a GPU — the design choices, the trade-offs, and what the hardware fights against.

Constraint Satisfaction Apple MPS CUDA Cooperative Kernels PyTorch A Few Benchmarks
What this is — and what it isn't

This is an exploration of what it takes to move arc consistency to a GPU — the design choices, the constraints, and the challenges that arise when translating a fundamentally sequential algorithm into parallel hardware. It is not a rigorous performance benchmark.

The scope is restricted to extensional binary constraints — constraints defined by explicit support tables rather than arithmetic expressions. This makes support mask precomputation straightforward and keeps the GPU data structures simple. Intensional or global constraints are out of scope.

The starting point is a basic arc consistency implementation, not the most efficient algorithm in the literature. Optimising the algorithm itself and comparing against state-of-the-art CPU solvers was not the goal. The purpose was to understand the shape of the problem: what has to change, what the GPU fights against, and where it can win.

The results do suggest that for sufficiently large problems, GPU acceleration can pay off — but that finding is a by-product of the exploration, not its conclusion.

This is early-stage work. The implementations are functional and produce correct results on tested cases, but have not been audited for edge cases or hardened for production use. There is likely room for bugs.

01 The Problem: Pruning Impossible Values

Before searching for a solution, we can eliminate huge numbers of impossible values. Arc consistency does exactly this — and doing it fast matters enormously.

What is a Binary CSP?

A Constraint Satisfaction Problem (CSP) has three ingredients: variables (things we need to assign), domains (the possible values for each variable), and constraints (rules that pairs of variables must satisfy).

Running Example — used throughout this page

Three variables: X, Y, Z. Constraint: X < Y < Z. Domains: X = {1,2,3,4}, Y = {1,2,3,4}, Z = {1,2,3,4}.

Before we search for an assignment, can we already tell some values are impossible?

The Arc Consistency Rule

A value is impossible if no valid assignment of its neighbor can coexist with it. The rule is:

The Core Rule (click to reveal)
A value x of variable X survives only if
every neighbor of X

still has at least one alive value compatible with x

Interactive: Watch Pruning Happen

Click Next Step to walk through pruning X, Y, Z with the constraint X < Y < Z. Watch dead values (❌) get eliminated as we scan neighbors.

Arc Consistency — Step by Step (X < Y < Z)
DOMAINS
Press "Next Step" to begin pruning.
Step 0

Why Does This Help?

In real CSPs with 60 variables and 20 values each, a brute-force search over all assignments would check up to 2060 combinations. Arc consistency can eliminate enough values that search becomes tractable — or it can immediately detect that no solution exists (UNSAT).

Key Insight

The solver doesn't find a solution directly. It just keeps deleting things that cannot possibly be in a solution, until either nothing more can be deleted (done!) or some variable runs out of values (UNSAT — no solution exists).

02 Data Structures and Propagation

Before considering GPUs, let's understand the data structures for arc consistency.

Data Structure 1: The Domain Matrix

Instead of storing lists, we use a boolean matrix: one row per variable, one column per value. true = alive, false = dead.

Domain Matrix — our running example
v=1v=2v=3v=4
X TTTT
Y TTTT
Z TTTT

After convergence, the matrix will show only the values that survive. This matrix is the core data structure uploaded to the GPU.

Data Structure 2: Support Masks

Asking "does X=2 have a compatible partner in Y?" requires checking every Y value against the constraint. We'd have to call the constraint function thousands of times. Instead, we precompute a lookup table — the support mask.

Why Precompute?

Constraint functions are Python lambdas — they cannot run on the GPU. We must evaluate them on the CPU first and bake the results into boolean arrays that the GPU can read.

Support Mask for arc X → Y (constraint: X < Y)
Y=1Y=2Y=3Y=4
X=1 FTTT
X=2 FFTT
X=3 FFFT
X=4 FFFF

Read row X=3: it is compatible with Y=4 only. If Y=4 is later deleted, X=3 loses its last support and must also be deleted.

Optimization: Transpose Reuse

The mask for arc Y → X is just the transpose of the X → Y mask. This can halve the precomputation time.

Data Structure 3: Arc List (CSR Format)

For each variable, we need to know its neighbors fast. We store this in CSR (Compressed Sparse Row) format — three flat arrays that are cache-friendly and GPU-uploadable. CSR was formalised in the Yale Sparse Matrix Package (Eisenstat et al., 1977) and has since become the standard in sparse linear algebra libraries from SciPy to NVIDIA's cuSPARSE.

CSR Arc Structures — X < Y < Z example
Array Values Meaning
arc_offsets [0, 1, 3] X starts at index 0, Y at 1, Z at 3
arc_counts [1, 2, 1] X has 1 arc, Y has 2, Z has 1
arc_list [arc0,Y], [arc1,X], [arc2,Z], [arc3,Y] X→Y, Y→X, Y→Z, Z→Y

Each entry in arc_list is [arc_id, neighbor_variable]. The arc_id indexes into the support mask array. This flat layout avoids pointer chasing — critical for GPU performance.

Data Structure 4: Arc Flags (Active Worklist)

Instead of tracking "which variables need to be rechecked," v3 tracks which arcs need processing. This is finer-grained — we can skip an arc that hasn't been affected by any deletion.

Arc Flags — Active Worklist
All arcs start active. Click to simulate a deletion cascade.

One Round of Propagation

We keep two copies of the domain matrix — one to read from (domains_cur), one to write to (domains_nxt). This ensures that deletions made during a round don't affect other variables' checks in the same round.

# Two-buffer round — read from cur, write to nxt
for each active variable X:
    for each alive value x in X:
        survived = True
        for each arc (X → Y):
            # Check: does Y still have any value compatible with x?
            has_support = any(support_mask[X→Y][x, y] and domains_cur[Y, y]
                              for y in range(domain_size))
            if not has_support:
                survived = False
                break
        domains_nxt[X, x] = survived

# After all variables: swap buffers, detect changes, update arc flags
domains_cur, domains_nxt = domains_nxt, domains_cur
Why Two Buffers?

If we modified the domain matrix in place, an early deletion could affect later checks in the same pass — changing which values survive. Two buffers guarantee all variables see the same frozen snapshot within one round. This is essential for determinism on parallel hardware.

Propagation Cascade

One deletion can trigger more deletions, which trigger more. The solver keeps running rounds until nothing changes. Here's why multiple rounds are needed:

Why Multiple Rounds? — X < Y < Z cascade
Start: X={1,2,3,4} Y={1,2,3,4} Z={1,2,3,4}
Round 1: X=4 needs Y>4 → none. Dead.
           Z=1 needs Y<1 → none. Dead.
Result 1: X={1,2,3} Y={1,2,3,4} Z={2,3,4}
Round 2: Y=4: needs X>4 → none. Dead. Y=1: needs Z<1 → none. Dead.
Result 2: X={1,2,3} Y={2,3} Z={2,3,4}
Round 3: X=3: needs Y>3 → none (max Y=3). Dead. Z=2: needs Y<2 → none. Dead.
Fixed point: X={1,2} Y={2,3} Z={3,4} ✓ Nothing more to delete.

The key observation: a deletion in one variable can cascade to force deletions in its neighbors, which cascade further. On a benchmark instance (60 variables, 20 values), this takes 24 rounds before either converging or proving UNSAT.

03 CPU vs GPU — Who Does What

Not everything can or should run on the GPU. This diagram shows the full algorithm from setup to result, with each step coloured by where it executes. Toggle between the CUDA and MPS variants to see how the split changes. Note: the CUDA variant shown here was not tested on hardware.

CPU GPU CPU + GPU (data transfer / sync) Example — X < Y < Z, domains {1,2,3,4}
CPU
Apple MPS (GPU)
Pseudocode
Example — X < Y < Z
Phase 1 — Setup (once, before any propagation)
Evaluate constraint(x,y) for all pairs → support masks (n_arcs, d, d)
for (i,j), fn in constraints: mask[i,j][x,y] = fn(x, y) mask[j,i] = mask[i,j].T
X<Y: mask[X→Y][x,y] = (x<y) 1 2 3 4 ← Y X=1: F T T T X=2: F F T T X=3: F F F T X=4: F F F F mask[Y→X] = mask[X→Y].T Same logic for Y<Z.
Build CSR arc structures
arc_offsets, arc_counts, arc_list
arc_index = {} global_arc_idx = 0 for i in vars: arc_offsets[i] = len(arc_list) for j in neighbors[i]: if (i,j) not in arc_index: arc_index[(i,j)] = global_arc_idx global_arc_idx += 1 arc_id = arc_index[(i,j)] arc_list.append([arc_id, j]) arc_counts[i] += 1
3 vars: X(0) Y(1) Z(2) Neighbors: X→[Y] Y→[X,Z] Z→[Y] arc_index: (X,Y)→0 (Y,X)→1 (Y,Z)→2 (Z,Y)→3 arc_offsets = [0, 1, 3, 4] arc_counts = [1, 2, 1]
Precompute padded arc tensors
arc_idx_padded, valid_arc_mask
for i in vars: for k in range(arc_counts[i]): arc_idx_padded[i,k] = arc_id valid_arc_mask[i,k] = True # trailing slots stay False (padding)
max_degree = 2 (Y has 2 nbrs) arc_idx_padded: X: [0, — ] (1 arc + 1 pad) Y: [1, 2 ] (2 arcs) Z: [3, — ] (1 arc + 1 pad) valid_arc_mask: X:[T,F] Y:[T,T] Z:[T,F]
Upload all arrays to MPS device memory
Receive: support masks, domains, arc structures
support_t = torch.tensor(masks, device='mps') domains_t = torch.tensor(domains, device='mps') arc_idx_t = torch.tensor(arc_idx_padded, device='mps')
support_t : (4, 4, 4) → MPS domains_t : (3, 4) all True X:[T,T,T,T] Y:[T,T,T,T] Z:[T,T,T,T] arc_idx_t : (3, 2) padded → MPS
Initialise arc flags — all arcs active
arc_active = np.ones(n_arcs, dtype=bool)
arc_active = [T, T, T, T] arc 0: X→Y arc 1: Y→X arc 2: Y→Z arc 3: Z→Y all 4 arcs active before round 1
Phase 2 — Propagation loop (repeats each round)
ROUND LOOP — repeats until no changes or UNSAT
Check arc flags: which variables have at least one active arc?
active_vars = np.flatnonzero( np.any(arc_active[fwd_arc_flat] .reshape(n_vars,-1), axis=1))
Round 1: all 4 arcs active arc_active = [T, T, T, T] active_vars = [X, Y, Z] (all 3)
Gather padded arc structures for active variables
active_t = torch.tensor(active_vars) batch_arc_idx = arc_idx_t[active_t] batch_nbr_idx = nbr_idx_t[active_t] batch_valid = valid_mask_t[active_t]
active = [X, Y, Z], batch_size=3 batch_arc_idx (3×2): X:[0, —] Y:[1, 2] Z:[3, —] batch_nbr_idx (3×2): X:[Y, —] Y:[X, Z] Z:[Y, —]
Dispatch MPS kernel — tensors sent to GPU
Receive batch tensors, begin compute
# PyTorch dispatches when tensors are consumed support_batch = support_t[batch_arc_idx] nbr_domains = domains_t[batch_nbr_idx]
support_batch: (3, 2, 4, 4) on MPS nbr_domains : (3, 2, 4) X view: Y = [T, T, T, T] Y view: X = [T, T, T, T] Z = [T, T, T, T] Z view: Y = [T, T, T, T]
Dense support check
All active vars processed simultaneously
candidate = support_batch & \ nbr_domains[:, :, None, :] # shape: (active, degree, d, d) support_exists = torch.any(candidate, dim=3) # shape: (active, degree, d) survived = domain_batch & torch.all( support_exists | (~domain_batch)[:,None,:] | (~batch_valid)[:,:,None], dim=1) domains_nxt[active_vars] = survived
X=4: mask[X→Y][4,:]=[F,F,F,F] & Y=[T,T,T,T] → no support ✕ Y=1: mask[Y→X][1,:]=[F,F,F,F] ✕ Y=4: mask[Y→Z][4,:]=[F,F,F,F] ✕ Z=1: mask[Z→Y][1,:]=[F,F,F,F] ✕ → domains_nxt: X:{1,2,3} Y:{2,3} Z:{2,3,4}
Sync ①: read "any empty domain?"
Send scalar bool to CPU
if torch.any(~torch.any(domains_nxt, dim=1)): return False, domains_nxt # UNSAT
~any(domains_nxt, dim=1): X: any[T,T,T,F]=T → not empty ✓ Y: any[F,T,T,F]=T → not empty ✓ Z: any[F,T,T,T]=T → not empty ✓ → SAT so far, continue
Sync ②: read changed-variable indices
Send changed index list to CPU
changed_t = torch.nonzero( torch.any(domains_t != domains_nxt, dim=1)) changed = changed_t.cpu().numpy()
domains_t != domains_nxt: X: val 4 removed → changed Y: vals 1,4 removed → changed Z: val 1 removed → changed changed = [X, Y, Z]
Update arc flags:
OFF → forward arcs of processed vars
ON  → reverse arcs of changed vars
arc_active[fwd_arcs[active_vars]] = False arc_active[rev_arcs[changed]] = True # OFF before ON: preserves correct priority
OFF fwd arcs of [X,Y,Z]: arcs 0,1,2,3 all → False ON rev arcs of changed [X,Y,Z]: X changed → arc 1 (Y→X): True Y changed → arcs 0,3 (X→Y,Z→Y): True Z changed → arc 2 (Y→Z): True arc_active = [T,T,T,T] again
Swap buffers. If no changes → exit loop.
domains_t, domains_nxt = domains_nxt, domains_t if len(changed) == 0: break
buffers swapped; domains now: X:{1,2,3} Y:{2,3} Z:{2,3,4} changed=3 > 0 → round 2 starts Round 2: X=3 pruned; Z=2 pruned Round 3: no changes → loop exits
if len(changed) > 0 next round: back to Check arc flags otherwise → Phase 3
Phase 3 — Result
Download final domains from GPU
Send final domain matrix to CPU
final = domains_t.to('cpu').numpy()
After 3 rounds (MPS → CPU): X:[T,T,F,F] → {1, 2} Y:[F,T,T,F] → {2, 3} Z:[F,F,T,T] → {3, 4}
Return SAT + pruned domains
return True, final
SAT = True X:{1,2} (pruned 3, 4) Y:{2,3} (pruned 1, 4) Z:{3,4} (pruned 1, 2)
MPS: 2 CPU-GPU syncs per round

With 24 rounds on the hard case: 48 total syncs. Each sync flushes the GPU pipeline. The round-trip overhead is the primary bottleneck for multi-round workloads.

04 Seven Lessons from Porting Arc Consistency to GPU

The CPU version is clean, natural code. Moving to a GPU requires rethinking almost everything. Here are the seven non-obvious lessons learned from this project.

① The GPU Checks Support for All Values Simultaneously

On CPU, arc consistency checks values sequentially — value x=0, then x=1, then x=2. For each value, it scans neighbour values y=0, y=1... stopping as soon as support is found. One value at a time, one neighbour at a time.

On GPU the computation is fundamentally different. In MPS, a single tensor operation simultaneously determines whether support exists for every value of every variable. The expression support_mask & nbr_domains followed by any(dim=3) covers the entire domain in one shot — potentially thousands of (variable, value) pairs checked in parallel. There is no x=0, then x=1. There is no y=0, then y=1. Whether support is found at y=0 or y=24 makes no difference to cost — both are part of the same operation.

In CUDA the model is different: the GPU assigns one thread to each value — thread 0 is responsible for x=0, thread 1 for x=1, and so on. All threads run simultaneously, so support is checked for all values in parallel. The neighbour scan is further parallelised using __ballot_sync: a group of 32 threads (called a warp) simultaneously check 32 Y values and vote — if any thread's Y value provides support, the whole group knows immediately.

This means the CPU implementation cannot be directly ported to GPU — it has to be rewritten. The natural CPU structure (loop over values, check each one, prune if no support) doesn't translate. The algorithm must be reformulated as a tensor operation that computes support for all values simultaneously. Same result, fundamentally different structure.

How support is checked: CPU vs Bucketed MPS vs Single-Batch MPS
CPU — sequential
→ x=0: scan y=0,1,2… ✓
→ x=1: scan y=0,1,2… ✗
→ x=2: scan y=0,1,2… ✓
→ x=3: scan y=0,1… ✓
One value at a time. Early exit saves work. Residual cache helps.
Bucketed MPS — earlier approach
group vars by degree →
batch 1: deg-2 vars → op
batch 2: deg-3 vars → op
batch 3: deg-5 vars → op
… up to 13–19 batches/round
Each batch is a separate GPU operation. Many dispatches per round. GPU sits idle between batches.
Single-Batch MPS — current approach
pad all vars to max_degree →
support_mask & nbr_domains → any(dim=3)
One rectangular tensor covers all variables at once. Single GPU operation per round.
② No Python Lambdas on the GPU — Booleanise First

Constraints like lambda x, y: x < y are Python functions. GPUs cannot execute Python. Every constraint must be evaluated on the CPU first and stored in a boolean support mask. The GPU only reads from this precomputed table — it never evaluates constraints itself.

# CPU: evaluate all pairs once, store result as boolean matrix
for vi in range(domain_size):
    for vj in range(domain_size):
        mask[vi, vj] = constraint(vi, vj)   # e.g. lambda x,y: x < y

# Upload mask to GPU. From here, constraint is never called again.
support_t = torch.tensor(mask, dtype=torch.bool, device='mps')
③ The Residual Cache Loses Its Benefit on GPU

On CPU, caching the last known supporting value is powerful: if the cached value is still alive, you skip the full scan — O(1) instead of O(d).

On GPU this benefit evaporates — for two different reasons depending on backend. In CUDA, the 32 threads in a warp always advance together in lockstep (they cannot finish independently). Even if one thread hits the cache and could stop early, the entire group waits for the slowest. The round takes the same time whether the cache hits or not. In MPS, there is no per-value sequential scan at all — the support check is a single tensor operation over the entire domain. There is no position to cache and no scan to exit early from; whether support is at y=0 or y=24 makes no difference to cost.

The consequence is the same in both cases: optimisations that exploit sequential order — caching the last support position, early exit when support is found — do not carry over to GPU.

The Winning MPS Computation — 4 Lines of PyTorch
# support_batch: (n_active, max_degree, d, d) — support masks for active vars
# nbr_domains:   (n_active, max_degree, d)    — alive values in each neighbor

# "Is (vi, vj) a valid pair AND is vj alive?"
candidate = support_batch & nbr_domains[:, :, None, :]
#            ↑ shape: (n_active, max_degree, d, d)

# "Does any vj survive?" → True if ANY column in a vi-row is True
support_exists = torch.any(candidate, dim=3)
#                ↑ shape: (n_active, max_degree, d)

# "Did vi survive?" → alive AND all neighbors have support
survived = domain_batch & torch.all(
    support_exists | (~domain_batch)[:, None, :] | (~batch_valid)[:, :, None],
    dim=1,
)
Why This Works

No Python loops over values. No branching. One rectangular tensor operation that the MPS scheduler can dispatch as a single GPU kernel. The padding positions are masked by ~batch_valid so they don't affect the result.

④ Padding Variables to the Same Degree

Variables have different numbers of neighbors — this is called their degree. To process all variables together in a single GPU operation, every row of the tensor must be the same width. We pad every variable's arc list to max_degree slots using dummy entries and a validity mask. Padding slots are masked out and never affect results — but their presence is what makes the tensor rectangular so one operation covers everyone.

# Padded arc structures — filled for all variables up to max_degree
arc_idx_padded[i, k]      = arc_id           # real arc index, or 0 for padding
neighbor_idx_padded[i, k] = neighbor_var     # real neighbor, or 0 for padding
valid_arc_mask[i, k]      = True            # False for padding positions

# Variables with fewer arcs just have trailing False entries in valid_arc_mask
How padding works — 3 variables, max_degree = 5
variable degree arc slot 0 arc slot 1 arc slot 2 arc slot 3 arc slot 4
X (deg 2) 2 X→Y ✓ X→Z ✓ pad ✗ pad ✗ pad ✗
Y (deg 5) 5 Y→X ✓ Y→Z ✓ Y→A ✓ Y→B ✓ Y→C ✓
Z (deg 3) 3 Z→X ✓ Z→Y ✓ Z→W ✓ pad ✗ pad ✗

All three variables sit in the same rectangular tensor with width 5 (= max_degree). Green slots are real arcs included in the support check. Red slots are padding — their valid_arc_mask entry is False, so the ~batch_valid term in the final all() call treats them as "already satisfied" and they never prune a value.

⑤ A Classic AC-3 Worklist Optimisation That Doesn't Transfer to GPU

AC-3 has a well-known refinement: when variable X is revised because of arc (Y→X), there's no need to recheck arc (X→Y) — Y just constrained X, so X can't have gained new values that would break Y. On CPU this saves real work.

On GPU this doesn't apply. All arcs in a round are processed simultaneously — the GPU has no record of why a variable changed or which arc caused it. Every changed variable wakes all its reverse arcs, conservatively. Tracking that provenance in parallel would require every thread to record what it changed and why, and multiple threads writing to the same record simultaneously would corrupt each other's data. The cost outweighs the benefit.

⑥ Apple MPS: The Propagation Loop Cannot Stay on the GPU

This is the single most important constraint when targeting Apple Silicon. On NVIDIA GPUs, the entire propagation loop — all rounds until convergence — can run entirely on the GPU. The CPU hands off the problem and waits idle. On Apple MPS this is not possible. After every round, the GPU must hand results back to the CPU, the CPU checks whether any values were pruned and whether to continue, then sends the GPU into the next round. Every round, back and forth.

The consequence: MPS always pays the cost of 2 CPU-GPU handoffs per round, regardless of how the computation is structured. On a 24-round instance that is 48 handoffs vs NVIDIA's 2 total. This is an architectural ceiling — not an implementation choice that can be optimised away.

⑦ Two-Buffer Semantics: CPU and GPU Must Agree

At the start of each round, the GPU reads the current domains and writes pruned results to a separate copy. At round end the copies swap. This means all threads in a round see the same snapshot — no thread sees another thread's prunings mid-round. The CPU reference implementation does the same. If the GPU used in-place updates (writing back to the same copy it's reading) while the CPU used two-buffer, they would see different intermediate states within each round and disagree on the final result — not because either is wrong, but because they are operating on a subtly different version of the algorithm.

05 Four Implementations Compared

The solver was developed iteratively. Here are the four main variants, what each one does, and how they perform.

SLOWER

MPS Bucketed (earlier approach)

Variables grouped by degree into separate batches — one GPU operation per degree bucket per round. Typically 13–19 batches per round. Between batches the GPU sits idle while the CPU sets up the next group.

Replaced by single-batch approach. Too many small dispatches per round.

BEST MPS

MPS v3 Two-Buffer

Single rectangular batch per round. Padded to max_degree. Dense tensor ops: &, any(dim=3), all(dim=1). Arc flags on CPU. 2 CPU-GPU syncs per round.

Best on Apple MPS for multi-round (UNSAT) cases.

MPS VARIANT

MPS v3 In-Place

Single buffer. Writes (true→false only) are monotonic — no value ever comes back. Round count is identical to two-buffer because index_select snapshots neighbor domains before any writes, so within-round visibility is the same. The benefit is avoiding the end-of-round copy_ and buffer swap. Final domains at convergence are identical.

Why monotonic writes make races harmless: if two GPU threads race to write the same domain cell, both are writing False (a deletion). There is no winning or losing write — both threads reach the same outcome regardless of order. A cell can never be erroneously restored to True by a late write, because no thread ever writes True.

Faster on SAT cases. Mixed results on UNSAT.

SLOWER

MPS GPU-Resident

Attempted to move arc flag updates entirely to GPU using index_fill_ calls. Eliminated CPU round-trips. But: many small GPU dispatches + no MPS grid.sync() = slower than CPU-managed flags.

111ms vs 52ms on hard case. Reverted.

CUDA Cooperative Kernel — How It Works

Not yet experimented with on my machine

My machine runs Apple Silicon (M4-class). CUDA is unavailable here — the cooperative kernel path exists in gpu_ac_v2.py and gpu_ac_v3.py but has not been benchmarked or validated in my environment. The description below is based on the implementation and design intent, not measured results.

The CUDA kernel runs entirely on the GPU. All rounds happen inside a single kernel launch. grid.sync() acts as a barrier between phases — every thread waits until all threads globally are done before moving to the next phase.

// CUDA kernel — one block per variable, one warp per value
for (round = 0; round < max_rounds; round++) {

    // PHASE 1: Revise domains (all blocks in parallel)
    for (var_i = blockIdx.x; var_i < n_vars; var_i += gridDim.x) {
        // threadIdx.y = value vi, threadIdx.x = vj scan lane (0..31)
        for each arc (var_i → var_j):
            // Cache check: lane 0 checks residual, broadcasts to warp
            cache_hit = __shfl_sync(0xffffffff, check_residual(), 0);
            if (cache_hit) continue;

            // Vectorised scan: ALL 32 lanes check 32 vj positions at once
            ballot = __ballot_sync(0xffffffff, alive[vj] && supports[vi,vj]);
            // ballot is a 32-bit mask — any bit set = support found
    }

    grid.sync();   // wait for ALL threads globally

    // PHASE 2: Update arc flags
    // PHASE 3: Termination check

    grid.sync();
    if (result[0] != 0) return;   // done!
}

MPS Single-Batch — The Key Insight

An earlier "bucketed" approach grouped variables by degree, creating 13–19 separate batches per round. Each batch launched its own chain of MPS kernels. The dispatch overhead was enormous.

The current approach: one batch, one chain, every round. Variables with fewer arcs are padded to max_degree — the wasted computation on padding positions is negligible compared to the reduced dispatch overhead.

Bucketed Approach vs Single-Batch
❌ Bucketed: 13–19 batches/round
Degree-1 vars → batch 1 → kernel dispatch
Degree-2 vars → batch 2 → kernel dispatch
... × 17 more batches ...
Total: ~227ms on hard case
✓ Single-batch: 1 batch/round
All active vars padded to max_degree
One rectangular tensor → one kernel chain
Padding masked via valid_arc_mask
Total: ~57ms on hard case (4x faster)

06 Benchmark Results

Measured on Apple Silicon M4, PyTorch 2.11.0, MPS backend. 5 runs after 2 warmup runs. Times are averages in milliseconds.

What "Hard" Means

The "hard" case is Model B with n=60 variables, d=20 values, density=0.35, tightness=0.75, seed=0. It takes 24 propagation rounds before proving UNSAT. This is the real stress test — the solver that wins here is genuinely faster, not just faster at launching kernels.

Instance Precompute
(shared, one-off)
MPS single-batch MPS in-place GPU-resident Bucketed MPS
MPS baseline
Key Observations

Tiny instances: CPU wins. MPS launch overhead (~1–3ms) dominates when the actual work is <0.1ms.

Medium instances (1 round): MPS v3 achieves 14–40x speedup. Single-batch approach shines.

Hard instance (24 rounds, UNSAT): MPS v3 achieves 3.8x speedup and is 4x faster than the bucketed approach.

GPU-resident arc flags: Slower on the hard case (111ms vs 57ms). Apple MPS has no grid.sync() — many small dispatches add up.

Precompute cost: The support mask build (CPU, one-off) grows fast — 32ms for the hard case, 531ms for xlarge. For single-propagation use, this dominates total time regardless of which backend runs propagation. The GPU advantage is most meaningful when the same precomputed masks are reused across many propagation calls.

When Is GPU Worth It?

Based on the results above, the GPU tends to help when at least two of these are true:

  • Large problem — n ≥ 60 variables with d ≥ 20 values. Below this, MPS launch overhead dominates the actual work.
  • Multi-round propagation — the instance requires many rounds (e.g., UNSAT detection). Single-round instances give the GPU less to amortise its overhead over.
  • Masks reused — precomputed support masks are shared across multiple propagation calls (e.g., solving many instances with the same constraint structure, or repeated propagation inside a search). For single-call use, precompute time can dwarf the solve time on both backends.

The benchmarked instances are small by constraint-programming standards. In practice, a solver embedded in a branch-and-bound search would call propagation thousands of times per run — and reuse the same precomputed masks each time. That's the use case where the GPU advantage compounds.

End-to-End Time: Precompute + Solve

The table above shows solve-only times. Adding precompute (which is shared and one-off, but still costs time on first call) changes the picture considerably for large instances:

Instance CPU total Best MPS total Solve-only speedup End-to-end speedup
XCSP3 (3 vars) 0.04 ms 1.57 ms 0.02× 0.03×
Easy (n=40) 12.3 ms 4.2 ms 9.1× 2.9×
Medium (n=60, 1 round) 69.0 ms 30.5 ms 16.2× 2.3×
Stress-test (n=60, 24 rds) 249.6 ms 83.3 ms 3.8× 3.0×
Large (n=120) 422.9 ms 221.2 ms 29.5× 1.9×
XLarge (n=180) 1033.1 ms 541.0 ms 51.1× 1.9×

The 29–51× solve-only speedup on large instances collapses to ~1.9× end-to-end because precompute (531ms for xlarge) dominates both CPU and MPS total time equally. The GPU advantage is real only when masks are built once and reused across many calls — for example, solving many instances with the same constraint structure, or repeated propagation within a search.

07 What Failed and Why

Every failed experiment taught something. These are the paths that were tried, measured, and rejected.

① Per-Value Residual Checks in PyTorch — Too Branchy for MPS

The first attempt implemented arc consistency the natural way in PyTorch: for each value, check the cached support position, then scan the neighbour domain. This involved Python-level loops and per-value conditionals. MPS does not handle this kind of branchy, value-by-value logic efficiently — it essentially serialised the computation. Result: barely faster than CPU (227ms vs 218ms) on the hard case, and slower on easy cases.

Lesson: MPS needs dense uniform tensor operations, not per-value conditional logic.

② In-Place MPS Schedule (First Attempt)

The first in-place attempt wrote domain deletions directly into the single buffer during the sweep — no snapshot. This made early deletions immediately visible to later checks in the same round. The behavior genuinely differed from two-buffer semantics: round count shifted from 24 to 14 on the hard case, and the final domains diverged from the CPU reference.

Lesson: "Fewer rounds" is not the acceptance bar. matches_cpu=True on the hard case is.

The v3 In-Place Variant Is a Different Thing

The v3 in-place solver uses index_select to gather neighbor domains before the support check. This creates a snapshot — neighbor domains are captured at round start, before any writes land. So v3 in-place and v3 two-buffer see exactly the same intermediate states and produce exactly the same round count. The only difference is that in-place skips the end-of-round copy_ and buffer swap. Both variants agree with the CPU reference on the hard case.

③ GPU-Resident Arc Flag Updates

Arc flags live on CPU and get updated with NumPy after each round (2 CPU-GPU syncs). An experiment moved flag updates to the GPU using index_fill_ calls. This eliminated CPU computation but required many small GPU dispatches. Result: 111ms vs 57ms on the hard case — twice as slow.

Lesson: Apple MPS has no grid.sync(). Small GPU dispatches are expensive. CPU-managed flags with 2 syncs per round is cheaper than many GPU dispatches.

④ Sparse Neighbor Gathering

Instead of passing the full neighbor domain (all d values including dead ones), gather only the alive neighbor values. This reduces the size of the support check. But building the gather index tensors cost more than the saving from reduced computation. The dense approach (mask dead values in the support check) was faster.

Lesson: Index-gather overhead on MPS is not free. Dense masked operations often beat sparse ones.

⑤ Vectorised Neighbor-Load Estimation

The bucketed approach estimated "neighbor load" (total alive values across all neighbors) using a matrix multiply to group variables into similar-cost batches. This overhead was not worth it — the groups weren't uniform enough to help, and the estimation cost was measurable.

Lesson: Fancy load balancing helps only if batches are otherwise very unbalanced. The single-batch approach sidestepped this entirely.

⑥ Degree Bucketing (Multi-Batch)

The original MPS approach grouped variables by degree (number of neighbors) and processed each degree level separately, creating 13–19 batches per round. Each batch launched its own chain of MPS kernels. Kernel launch overhead dominated. The single padded batch replaced all of them.

Lesson: One large regular batch beats many small irregular ones, even with wasted computation on padding positions.

08 The Apple GPU Limitation

The most fundamental difference between NVIDIA CUDA and Apple MPS for this algorithm: one can stay on the GPU forever, the other cannot.

The Constraint: Every Round Returns to the CPU

On NVIDIA, the entire propagation loop — all rounds until convergence — runs inside a single GPU invocation. The CPU hands off the problem and waits. On Apple MPS, this is not possible. After every round the GPU hands results back to the CPU, the CPU checks convergence and schedules the next round, then the GPU starts again. (The CUDA implementation detail is covered in section 05.)

CPU-GPU Sync Pattern — click to animate

What Happens Each Round

Because MPS must return to the CPU between rounds, the CPU also handles arc-flag bookkeeping — which turns out to be cheaper than dispatching extra GPU operations for it:

CPU: Check arc flags

Which variables have at least one active arc? Fast NumPy operation.

GPU: Dense support check

One operation chain covering all variables: gather → & → any → all → write.

GPU → CPU: 2 handoffs

Read whether any domain went empty. Read which variables changed.

CPU: Update arc flags

Turn off processed arcs, wake reverse arcs of changed variables.

What Would Change Everything

If Apple added a grid.sync() equivalent to Metal:

Until then, the best strategy is: minimize round-trips, maximize work per GPU dispatch.

The Fundamental Trade-off

For arc consistency on a single CSP instance, the hard case (24 rounds) is the real challenge. MPS achieves 3.8x speedup by doing heavy tensor work each round. CUDA achieves more by never leaving the GPU at all. The GPU-resident cooperative kernel pattern is simply not available on Apple Silicon today.

09 What Comes Next

The MPS implementation is working and the design is understood. The natural next step is to validate the same approach on hardware that removes the core constraint.

Try on CUDA

The entire MPS implementation was built under one constraint: no grid.sync(). Every design decision — single-batch, CPU-managed arc flags, 2 syncs per round — is a response to that limitation. CUDA removes it.

The open question is whether the MPS-style dense single-batch approach outperforms the CUDA cooperative kernel on actual NVIDIA hardware, or whether keeping the loop GPU-resident dominates once the architectural ceiling is gone. The two approaches have never been benchmarked head-to-head on the same machine.

Current state (MPS)
  • 2 CPU-GPU syncs per round
  • CPU manages arc flags each round
  • 48 sync points on a 24-round instance
  • Dense single-batch tensor ops
  • Loop cannot stay on GPU
What CUDA unlocks
  • 2 CPU-GPU syncs total via grid.sync()
  • Arc flags live on GPU, updated in-kernel
  • Loop never leaves the GPU
  • Convergence check happens inside the kernel
  • Direct test of whether the design holds
Adaptive CPU–GPU Switching Per Search Node

Arc consistency is rarely used in isolation — it sits inside a backtracking search, called at every node after a variable assignment. Near the root of the search tree, domains are large and the GPU earns its keep. But each assignment shrinks domains further. By the time you are 20 decisions deep, a variable might have 2 or 3 values left, and launching a GPU kernel to check 3 values costs more than just doing it on the CPU.

A natural question is whether the solver should track average domain size at each node and switch between GPU and CPU arc consistency dynamically — GPU when domains are dense, CPU when they are not. The crossover point is roughly when alive values per variable drops below the warp size (32), at which point threads are no longer filling warps. The complication is transfer cost: if domain state lives on the GPU, switching to CPU means copying it back. Whether that cost is worth paying depends on how many rounds remain and how quickly domains are shrinking.

10 References

The arc consistency literature spans five decades. These are the papers that shaped how the algorithm is understood today, plus the GPU programming resources that informed this project.

Arc Consistency Algorithms
Mackworth, A.K. (1977)
Consistency in Networks of Relations. Artificial Intelligence, 8(1), 99–118.
AC-1 · AC-2 · AC-3 — the founding paper. Introduced the arc consistency concept and the first three algorithms.
Mohr, R. & Henderson, T.C. (1986)
Arc and Path Consistency Revisited. Artificial Intelligence, 28(2), 225–233.
AC-4 — optimal worst-case complexity O(ed²); processes supports rather than arcs.
Bessière, C. & Régin, J.C. (1996)
MAC-CBJ: Combining Backjumping and Maintaining Arc Consistency. AAAI 1996.
AC-6 — reduces redundant support checks; maintains a single support witness per value.
Zhang, Y. & Yap, R.H.C. (2001)
Making AC-3 an Optimal Algorithm. IJCAI 2001, 316–321.
AC-3.1 — patches AC-3 to achieve optimal O(ed²) by avoiding redundant constraint re-checks.
Bessière, C., Régin, J.C., Yap, R.H.C., & Zhang, Y. (2005)
An Optimal Coarse-Grained Arc Consistency Algorithm. Artificial Intelligence, 165(2), 165–185.
AC-2001 — practically fast and theoretically optimal; widely used as a baseline in modern solvers.
Lecoutre, C. (2009)
Constraint Networks: Techniques and Algorithms. ISTE / Wiley.
Comprehensive textbook covering AC variants, complexity analysis, and solver design in depth.
GPU Architecture & Programming
NVIDIA Corporation (2023)
CUDA C++ Programming Guide. NVIDIA Developer Documentation.
Warps, memory coalescing, shared memory, and cooperative groups — the reference for everything GPU-side in this project.
Apple Inc. (2023)
Metal Performance Shaders Framework. Apple Developer Documentation.
MPS tensor ops and the absence of grid-wide synchronisation — the constraint that shaped every design decision here.
Owens, J.D. et al. (2008)
GPU Computing. Proceedings of the IEEE, 96(5), 879–899.
Foundational survey of the GPU programming model — useful context for understanding where constraint propagation fits.
Related ↗ GPU Constraint Search — Literature Map