GPU Arc Consistency — From Problem to Silicon

01 The Problem: Pruning Impossible Values

Before searching for a solution, we can eliminate huge numbers of impossible values. Arc consistency does exactly this — and doing it fast matters enormously.

What is a Binary CSP?

A Constraint Satisfaction Problem (CSP) has three ingredients: variables (things we need to assign), domains (the possible values for each variable), and constraints (rules that pairs of variables must satisfy).

Running Example — used throughout this page

Three variables: X, Y, Z. Constraint: X < Y < Z. Domains: X = {1,2,3,4}, Y = {1,2,3,4}, Z = {1,2,3,4}.

Before we search for an assignment, can we already tell some values are impossible?

The Arc Consistency Rule

A value is impossible if no valid assignment of its neighbor can coexist with it. The rule is:

The Core Rule (click to reveal)

A value x of variable X survives only if

every neighbor of X
↓
still has at least one alive value compatible with x

Interactive: Watch Pruning Happen

Click Next Step to walk through pruning X, Y, Z with the constraint X < Y < Z. Watch dead values (❌) get eliminated as we scan neighbors.

Arc Consistency — Step by Step (X < Y < Z)

DOMAINS

Press "Next Step" to begin pruning.

Step 0

Why Does This Help?

In real CSPs with 60 variables and 20 values each, a brute-force search over all assignments would check up to 20⁶⁰ combinations. Arc consistency can eliminate enough values that search becomes tractable — or it can immediately detect that no solution exists (UNSAT).

Key Insight

The solver doesn't find a solution directly. It just keeps deleting things that cannot possibly be in a solution, until either nothing more can be deleted (done!) or some variable runs out of values (UNSAT — no solution exists).

02 Data Structures and Propagation

Before considering GPUs, let's understand the data structures for arc consistency.

Data Structure 1: The Domain Matrix

Instead of storing lists, we use a boolean matrix: one row per variable, one column per value. true = alive, false = dead.

Domain Matrix — our running example

	v=1	v=2	v=3	v=4
X	T	T	T	T
Y	T	T	T	T
Z	T	T	T	T

After convergence, the matrix will show only the values that survive. This matrix is the core data structure uploaded to the GPU.

Data Structure 2: Support Masks

Asking "does X=2 have a compatible partner in Y?" requires checking every Y value against the constraint. We'd have to call the constraint function thousands of times. Instead, we precompute a lookup table — the support mask.

Why Precompute?

Constraint functions are Python lambdas — they cannot run on the GPU. We must evaluate them on the CPU first and bake the results into boolean arrays that the GPU can read.

Support Mask for arc X → Y (constraint: X < Y)

	Y=1	Y=2	Y=3	Y=4
X=1	F	T	T	T
X=2	F	F	T	T
X=3	F	F	F	T
X=4	F	F	F	F

Read row X=3: it is compatible with Y=4 only. If Y=4 is later deleted, X=3 loses its last support and must also be deleted.

Optimization: Transpose Reuse

The mask for arc Y → X is just the transpose of the X → Y mask. This can halve the precomputation time.

Data Structure 3: Arc List (CSR Format)

For each variable, we need to know its neighbors fast. We store this in CSR (Compressed Sparse Row) format — three flat arrays that are cache-friendly and GPU-uploadable. CSR was formalised in the Yale Sparse Matrix Package (Eisenstat et al., 1977) and has since become the standard in sparse linear algebra libraries from SciPy to NVIDIA's cuSPARSE.

CSR Arc Structures — X < Y < Z example

Array	Values	Meaning
arc_offsets	[0, 1, 3]	X starts at index 0, Y at 1, Z at 3
arc_counts	[1, 2, 1]	X has 1 arc, Y has 2, Z has 1
arc_list	[arc0,Y], [arc1,X], [arc2,Z], [arc3,Y]	X→Y, Y→X, Y→Z, Z→Y

Each entry in arc_list is [arc_id, neighbor_variable]. The arc_id indexes into the support mask array. This flat layout avoids pointer chasing — critical for GPU performance.

Data Structure 4: Arc Flags (Active Worklist)

Instead of tracking "which variables need to be rechecked," v3 tracks which arcs need processing. This is finer-grained — we can skip an arc that hasn't been affected by any deletion.

Arc Flags — Active Worklist

All arcs start active. Click to simulate a deletion cascade.

One Round of Propagation

We keep two copies of the domain matrix — one to read from (domains_cur), one to write to (domains_nxt). This ensures that deletions made during a round don't affect other variables' checks in the same round.

# Two-buffer round — read from cur, write to nxt
for each active variable X:
    for each alive value x in X:
        survived = True
        for each arc (X → Y):
            # Check: does Y still have any value compatible with x?
            has_support = any(support_mask[X→Y][x, y] and domains_cur[Y, y]
                              for y in range(domain_size))
            if not has_support:
                survived = False
                break
        domains_nxt[X, x] = survived

# After all variables: swap buffers, detect changes, update arc flags
domains_cur, domains_nxt = domains_nxt, domains_cur

Why Two Buffers?

If we modified the domain matrix in place, an early deletion could affect later checks in the same pass — changing which values survive. Two buffers guarantee all variables see the same frozen snapshot within one round. This is essential for determinism on parallel hardware.

Propagation Cascade

One deletion can trigger more deletions, which trigger more. The solver keeps running rounds until nothing changes. Here's why multiple rounds are needed:

Why Multiple Rounds? — X < Y < Z cascade

Start: X={1,2,3,4} Y={1,2,3,4} Z={1,2,3,4}

Round 1: X=4 needs Y>4 → none. Dead.
Z=1 needs Y<1 → none. Dead.

Result 1: X={1,2,3} Y={1,2,3,4} Z={2,3,4}

Round 2: Y=4: needs X>4 → none. Dead. Y=1: needs Z<1 → none. Dead.

Result 2: X={1,2,3} Y={2,3} Z={2,3,4}

Round 3: X=3: needs Y>3 → none (max Y=3). Dead. Z=2: needs Y<2 → none. Dead.

Fixed point: X={1,2} Y={2,3} Z={3,4} ✓ Nothing more to delete.

The key observation: a deletion in one variable can cascade to force deletions in its neighbors, which cascade further. On a benchmark instance (60 variables, 20 values), this takes 24 rounds before either converging or proving UNSAT.

03 CPU vs GPU — Who Does What

Not everything can or should run on the GPU. This diagram shows the full algorithm from setup to result, with each step coloured by where it executes. Toggle between the CUDA and MPS variants to see how the split changes. Note: the CUDA variant shown here was not tested on hardware.

CPU GPU CPU + GPU (data transfer / sync) Example — X < Y < Z, domains {1,2,3,4}

CPU

Apple MPS (GPU)

Pseudocode

Example — X < Y < Z

Phase 1 — Setup (once, before any propagation)

Evaluate constraint(x,y) for all pairs → support masks (n_arcs, d, d)

—

for (i,j), fn in constraints: mask[i,j][x,y] = fn(x, y) mask[j,i] = mask[i,j].T

X<Y: mask[X→Y][x,y] = (x<y) 1 2 3 4 ← Y X=1: F T T T X=2: F F T T X=3: F F F T X=4: F F F F mask[Y→X] = mask[X→Y].T Same logic for Y<Z.

Build CSR arc structures
arc_offsets, arc_counts, arc_list

—

arc_index = {} global_arc_idx = 0 for i in vars: arc_offsets[i] = len(arc_list) for j in neighbors[i]: if (i,j) not in arc_index: arc_index[(i,j)] = global_arc_idx global_arc_idx += 1 arc_id = arc_index[(i,j)] arc_list.append([arc_id, j]) arc_counts[i] += 1

3 vars: X(0) Y(1) Z(2) Neighbors: X→[Y] Y→[X,Z] Z→[Y] arc_index: (X,Y)→0 (Y,X)→1 (Y,Z)→2 (Z,Y)→3 arc_offsets = [0, 1, 3, 4] arc_counts = [1, 2, 1]

Precompute padded arc tensors
arc_idx_padded, valid_arc_mask

—

for i in vars: for k in range(arc_counts[i]): arc_idx_padded[i,k] = arc_id valid_arc_mask[i,k] = True # trailing slots stay False (padding)

max_degree = 2 (Y has 2 nbrs) arc_idx_padded: X: [0, — ] (1 arc + 1 pad) Y: [1, 2 ] (2 arcs) Z: [3, — ] (1 arc + 1 pad) valid_arc_mask: X:[T,F] Y:[T,T] Z:[T,F]

Upload all arrays to MPS device memory

Receive: support masks, domains, arc structures

support_t = torch.tensor(masks, device='mps') domains_t = torch.tensor(domains, device='mps') arc_idx_t = torch.tensor(arc_idx_padded, device='mps')

support_t : (4, 4, 4) → MPS domains_t : (3, 4) all True X:[T,T,T,T] Y:[T,T,T,T] Z:[T,T,T,T] arc_idx_t : (3, 2) padded → MPS

Initialise arc flags — all arcs active

—

arc_active = np.ones(n_arcs, dtype=bool)

arc_active = [T, T, T, T] arc 0: X→Y arc 1: Y→X arc 2: Y→Z arc 3: Z→Y all 4 arcs active before round 1

Phase 2 — Propagation loop (repeats each round)

ROUND LOOP — repeats until no changes or UNSAT

Check arc flags: which variables have at least one active arc?

—

active_vars = np.flatnonzero( np.any(arc_active[fwd_arc_flat] .reshape(n_vars,-1), axis=1))

Round 1: all 4 arcs active arc_active = [T, T, T, T] active_vars = [X, Y, Z] (all 3)

Gather padded arc structures for active variables

—

active_t = torch.tensor(active_vars) batch_arc_idx = arc_idx_t[active_t] batch_nbr_idx = nbr_idx_t[active_t] batch_valid = valid_mask_t[active_t]

active = [X, Y, Z], batch_size=3 batch_arc_idx (3×2): X:[0, —] Y:[1, 2] Z:[3, —] batch_nbr_idx (3×2): X:[Y, —] Y:[X, Z] Z:[Y, —]

Dispatch MPS kernel — tensors sent to GPU

Receive batch tensors, begin compute

# PyTorch dispatches when tensors are consumed support_batch = support_t[batch_arc_idx] nbr_domains = domains_t[batch_nbr_idx]

support_batch: (3, 2, 4, 4) on MPS nbr_domains : (3, 2, 4) X view: Y = [T, T, T, T] Y view: X = [T, T, T, T] Z = [T, T, T, T] Z view: Y = [T, T, T, T]

—

Dense support check
All active vars processed simultaneously

candidate = support_batch & \ nbr_domains[:, :, None, :] # shape: (active, degree, d, d) support_exists = torch.any(candidate, dim=3) # shape: (active, degree, d) survived = domain_batch & torch.all( support_exists | (~domain_batch)[:,None,:] | (~batch_valid)[:,:,None], dim=1) domains_nxt[active_vars] = survived

X=4: mask[X→Y][4,:]=[F,F,F,F] & Y=[T,T,T,T] → no support ✕ Y=1: mask[Y→X][1,:]=[F,F,F,F] ✕ Y=4: mask[Y→Z][4,:]=[F,F,F,F] ✕ Z=1: mask[Z→Y][1,:]=[F,F,F,F] ✕ → domains_nxt: X:{1,2,3} Y:{2,3} Z:{2,3,4}

Sync ①: read "any empty domain?"

Send scalar bool to CPU

if torch.any(~torch.any(domains_nxt, dim=1)): return False, domains_nxt # UNSAT

~any(domains_nxt, dim=1): X: any[T,T,T,F]=T → not empty ✓ Y: any[F,T,T,F]=T → not empty ✓ Z: any[F,T,T,T]=T → not empty ✓ → SAT so far, continue

Sync ②: read changed-variable indices

Send changed index list to CPU

changed_t = torch.nonzero( torch.any(domains_t != domains_nxt, dim=1)) changed = changed_t.cpu().numpy()

domains_t != domains_nxt: X: val 4 removed → changed Y: vals 1,4 removed → changed Z: val 1 removed → changed changed = [X, Y, Z]

Update arc flags:
OFF → forward arcs of processed vars
ON → reverse arcs of changed vars

—

arc_active[fwd_arcs[active_vars]] = False arc_active[rev_arcs[changed]] = True # OFF before ON: preserves correct priority

OFF fwd arcs of [X,Y,Z]: arcs 0,1,2,3 all → False ON rev arcs of changed [X,Y,Z]: X changed → arc 1 (Y→X): True Y changed → arcs 0,3 (X→Y,Z→Y): True Z changed → arc 2 (Y→Z): True arc_active = [T,T,T,T] again

Swap buffers. If no changes → exit loop.

—

domains_t, domains_nxt = domains_nxt, domains_t if len(changed) == 0: break

buffers swapped; domains now: X:{1,2,3} Y:{2,3} Z:{2,3,4} changed=3 > 0 → round 2 starts Round 2: X=3 pruned; Z=2 pruned Round 3: no changes → loop exits

↺

            if len(changed) > 0
            →
            next round: back to Check arc flags
            otherwise → Phase 3
          

Phase 3 — Result

Download final domains from GPU

Send final domain matrix to CPU

final = domains_t.to('cpu').numpy()

After 3 rounds (MPS → CPU): X:[T,T,F,F] → {1, 2} Y:[F,T,T,F] → {2, 3} Z:[F,F,T,T] → {3, 4}

Return SAT + pruned domains

—

return True, final

SAT = True X:{1,2} (pruned 3, 4) Y:{2,3} (pruned 1, 4) Z:{3,4} (pruned 1, 2)

MPS: 2 CPU-GPU syncs per round

With 24 rounds on the hard case: 48 total syncs. Each sync flushes the GPU pipeline. The round-trip overhead is the primary bottleneck for multi-round workloads.

CPU

CUDA GPU

Pseudocode

Example — X < Y < Z

Phase 1 — Setup (once, before any propagation)

Evaluate constraint(x,y) for all pairs → support masks

—

for (i,j), fn in constraints: mask[i,j][x,y] = fn(x, y) mask[j,i] = mask[i,j].T

X<Y: mask[X→Y][x,y] = (x<y) 1 2 3 4 ← Y X=1: F T T T X=2: F F T T X=3: F F F T X=4: F F F F mask[Y→X] = mask[X→Y].T Same logic for Y<Z.

Build CSR arc structures + reverse arc structures

—

# forward arcs: arc_offsets, arc_counts, arc_list # reverse arcs: rev_offsets, rev_counts, rev_list arc_to_var[arc_id] = source_variable

arc_offsets = [0, 1, 3, 4] arc_counts = [1, 2, 1] rev_offsets = [0, 1, 3, 4] (mirrors forward structure) arc_to_var: arc 0=X→Y arc 1=Y→X arc 2=Y→Z arc 3=Z→Y

Upload all arrays to CUDA device memory

Receive all data into GPU global memory

domains_gpu = cp.array(domains) support_gpu = cp.array(support_arr) arc_active_gpu = cp.ones(n_arcs, dtype=cp.bool_) residual_gpu = cp.full((n_vars,d,max_deg), -1)

domains_a: (3,4) all True on GPU X:[T,T,T,T] Y:[T,T,T,T] Z:[T,T,T,T] support_gpu : (4, 4, 4) on GPU arc_active_gpu: [T, T, T, T] residual_gpu: (3,4,2) = −1 (cold)

Launch cooperative kernel
(single call — CPU then waits)

Kernel starts. Grid of blocks allocated across all SMs.

AC_V3_KERNEL( (n_blocks,), (32, block_y), (domains_a, domains_b, support_gpu, arc_offsets_gpu, ...) )

n_blocks = 3 (one per variable) threads = (32, 4): 32 lanes × 4 vals CPU launches kernel → then idles GPU allocates 3 blocks: blk0→X blk1→Y blk2→Z

Phase 2 — Propagation loop (entirely on GPU)

GPU KERNEL LOOP — CPU waits idle throughout

— waiting —

Phase 1: All blocks revise domains in parallel.
threadIdx.y = value vi | threadIdx.x = vj lane

for var_i in grid_stride(n_vars): for val_vi in block_stride(domain_size): for arc_k in arcs[var_i]: # lane 0 checks residual cache cache_hit = check_residual(val_vi, arc_k) cache_hit = __shfl_sync(0xffffffff, cache_hit, 0) if cache_hit: continue # all 32 lanes scan vj simultaneously ballot = __ballot_sync(0xffffffff, alive[vj] && support[val_vi, vj]) if not ballot: survived = False

Block for X, val_vi=4 (X=4): 32 lanes scan Y domain in parallel ballot=__ballot_sync on Y&&mask[X→Y][4,:]: =[F,F,F,F] → ballot=0 → prune X=4! Block Y: Y=1 pruned (Y→X check fails) Y=4 pruned (Y→Z check fails) Block Z: Z=1 pruned (Z→Y check fails)

— waiting —

grid.sync() — all threads on all blocks wait

grid.sync() // global barrier — waits for every // thread on every SM before continuing

All 3 blocks (X, Y, Z) reach barrier. Pruned writes committed to domains_b: X:{1,2,3} Y:{2,3} Z:{2,3,4} All warps stall here then release. (No CPU involvement)

— waiting —

Phase 2: Detect changes, update arc flags, check UNSAT

for var_i in grid_stride(n_vars): changed = any(cur[var_i] != nxt[var_i]) empty = not any(nxt[var_i]) if empty: result[0] = UNSAT if changed: arc_active[rev_arcs[var_i]] = True arc_active[fwd_arcs[var_i]] = False

blk0 (X): cur[X,3]=T,nxt=F → changed rev arcs of X: arc 1 (Y→X) ON fwd arcs of X: arc 0 (X→Y) OFF blk1 (Y): vals 1,4 removed → changed arcs 0,3 ON; arcs 1,2 OFF blk2 (Z): val 1 removed → changed arc 2 (Y→Z) ON; arc 3 OFF

— waiting —

grid.sync() — global barrier

grid.sync()

arc_active_gpu = [T, T, T, T] No empty domain → UNSAT stays False All blocks re-synchronise. Round 2 begins for all blocks.

— waiting —

Phase 3: One thread checks termination, sets result, or continues

if block0_thread0: if result[0] == UNSAT: return if not any(arc_active): result[0] = SAT result[1] = round return use_a = !use_a // swap buffers in-kernel

block0_thread0 checks after each round: Round 1: arc_active has True → continue Round 2: arc_active has True → continue Round 3: no active arcs! result[0]=SAT; result[1]=3 → kernel exits; CPU wakes

↺

            if any(arc_active)
            →
            next round: back to Phase 1 revision (entirely on GPU — no CPU involvement)
            otherwise → kernel exits
          

Phase 3 — Result

Kernel returns. CPU wakes. Download final domains.

Kernel exits. Final domains in GPU memory.

final_buf = domains_a if round_count%2==0 \ else domains_b final = cp.asnumpy(final_buf)

CPU wakes after kernel exit. round_count=3 (odd) → final_buf = domains_b cp.asnumpy: X:[T,T,F,F] Y:[F,T,T,F] Z:[F,F,T,T]

Return SAT/UNSAT + pruned domains

—

return consistent, final

consistent = True X:{1,2} Y:{2,3} Z:{3,4} Only 2 CPU↔GPU syncs total: upload + kernel-result read (vs 48 syncs for MPS: 24×2)

CUDA: 2 total CPU-GPU syncs

One sync to upload data, one to read the result. All 24 rounds happen inside the cooperative kernel — the CPU never touches the loop. No pipeline flushes between rounds.

04 Seven Lessons from Porting Arc Consistency to GPU

The CPU version is clean, natural code. Moving to a GPU requires rethinking almost everything. Here are the seven non-obvious lessons learned from this project.

① The GPU Checks Support for All Values Simultaneously

On CPU, arc consistency checks values sequentially — value x=0, then x=1, then x=2. For each value, it scans neighbour values y=0, y=1... stopping as soon as support is found. One value at a time, one neighbour at a time.

On GPU the computation is fundamentally different. In MPS, a single tensor operation simultaneously determines whether support exists for every value of every variable. The expression support_mask & nbr_domains followed by any(dim=3) covers the entire domain in one shot — potentially thousands of (variable, value) pairs checked in parallel. There is no x=0, then x=1. There is no y=0, then y=1. Whether support is found at y=0 or y=24 makes no difference to cost — both are part of the same operation.

In CUDA the model is different: the GPU assigns one thread to each value — thread 0 is responsible for x=0, thread 1 for x=1, and so on. All threads run simultaneously, so support is checked for all values in parallel. The neighbour scan is further parallelised using __ballot_sync: a group of 32 threads (called a warp) simultaneously check 32 Y values and vote — if any thread's Y value provides support, the whole group knows immediately.

This means the CPU implementation cannot be directly ported to GPU — it has to be rewritten. The natural CPU structure (loop over values, check each one, prune if no support) doesn't translate. The algorithm must be reformulated as a tensor operation that computes support for all values simultaneously. Same result, fundamentally different structure.

How support is checked: CPU vs Bucketed MPS vs Single-Batch MPS

CPU — sequential

→ x=0: scan y=0,1,2… ✓
→ x=1: scan y=0,1,2… ✗
→ x=2: scan y=0,1,2… ✓
→ x=3: scan y=0,1… ✓

One value at a time. Early exit saves work. Residual cache helps.

Bucketed MPS — earlier approach

group vars by degree →
batch 1: deg-2 vars → op
batch 2: deg-3 vars → op
batch 3: deg-5 vars → op
… up to 13–19 batches/round

Each batch is a separate GPU operation. Many dispatches per round. GPU sits idle between batches.

Single-Batch MPS — current approach

pad all vars to max_degree →

support_mask & nbr_domains → any(dim=3)

One rectangular tensor covers all variables at once. Single GPU operation per round.

② No Python Lambdas on the GPU — Booleanise First

Constraints like lambda x, y: x < y are Python functions. GPUs cannot execute Python. Every constraint must be evaluated on the CPU first and stored in a boolean support mask. The GPU only reads from this precomputed table — it never evaluates constraints itself.

# CPU: evaluate all pairs once, store result as boolean matrix
for vi in range(domain_size):
    for vj in range(domain_size):
        mask[vi, vj] = constraint(vi, vj)   # e.g. lambda x,y: x < y

# Upload mask to GPU. From here, constraint is never called again.
support_t = torch.tensor(mask, dtype=torch.bool, device='mps')

③ The Residual Cache Loses Its Benefit on GPU

On CPU, caching the last known supporting value is powerful: if the cached value is still alive, you skip the full scan — O(1) instead of O(d).

On GPU this benefit evaporates — for two different reasons depending on backend. In CUDA, the 32 threads in a warp always advance together in lockstep (they cannot finish independently). Even if one thread hits the cache and could stop early, the entire group waits for the slowest. The round takes the same time whether the cache hits or not. In MPS, there is no per-value sequential scan at all — the support check is a single tensor operation over the entire domain. There is no position to cache and no scan to exit early from; whether support is at y=0 or y=24 makes no difference to cost.

The consequence is the same in both cases: optimisations that exploit sequential order — caching the last support position, early exit when support is found — do not carry over to GPU.

The Winning MPS Computation — 4 Lines of PyTorch

# support_batch: (n_active, max_degree, d, d) — support masks for active vars
# nbr_domains:   (n_active, max_degree, d)    — alive values in each neighbor

# "Is (vi, vj) a valid pair AND is vj alive?"
candidate = support_batch & nbr_domains[:, :, None, :]
#            ↑ shape: (n_active, max_degree, d, d)

# "Does any vj survive?" → True if ANY column in a vi-row is True
support_exists = torch.any(candidate, dim=3)
#                ↑ shape: (n_active, max_degree, d)

# "Did vi survive?" → alive AND all neighbors have support
survived = domain_batch & torch.all(
    support_exists | (~domain_batch)[:, None, :] | (~batch_valid)[:, :, None],
    dim=1,
)

Why This Works

No Python loops over values. No branching. One rectangular tensor operation that the MPS scheduler can dispatch as a single GPU kernel. The padding positions are masked by ~batch_valid so they don't affect the result.

④ Padding Variables to the Same Degree

Variables have different numbers of neighbors — this is called their degree. To process all variables together in a single GPU operation, every row of the tensor must be the same width. We pad every variable's arc list to max_degree slots using dummy entries and a validity mask. Padding slots are masked out and never affect results — but their presence is what makes the tensor rectangular so one operation covers everyone.

# Padded arc structures — filled for all variables up to max_degree
arc_idx_padded[i, k]      = arc_id           # real arc index, or 0 for padding
neighbor_idx_padded[i, k] = neighbor_var     # real neighbor, or 0 for padding
valid_arc_mask[i, k]      = True            # False for padding positions

# Variables with fewer arcs just have trailing False entries in valid_arc_mask

How padding works — 3 variables, max_degree = 5

variable	degree	arc slot 0	arc slot 1	arc slot 2	arc slot 3	arc slot 4
X (deg 2)	2	X→Y ✓	X→Z ✓	pad ✗	pad ✗	pad ✗
Y (deg 5)	5	Y→X ✓	Y→Z ✓	Y→A ✓	Y→B ✓	Y→C ✓
Z (deg 3)	3	Z→X ✓	Z→Y ✓	Z→W ✓	pad ✗	pad ✗

All three variables sit in the same rectangular tensor with width 5 (= max_degree). Green slots are real arcs included in the support check. Red slots are padding — their valid_arc_mask entry is False, so the ~batch_valid term in the final all() call treats them as "already satisfied" and they never prune a value.

⑤ A Classic AC-3 Worklist Optimisation That Doesn't Transfer to GPU

AC-3 has a well-known refinement: when variable X is revised because of arc (Y→X), there's no need to recheck arc (X→Y) — Y just constrained X, so X can't have gained new values that would break Y. On CPU this saves real work.

On GPU this doesn't apply. All arcs in a round are processed simultaneously — the GPU has no record of why a variable changed or which arc caused it. Every changed variable wakes all its reverse arcs, conservatively. Tracking that provenance in parallel would require every thread to record what it changed and why, and multiple threads writing to the same record simultaneously would corrupt each other's data. The cost outweighs the benefit.

⑥ Apple MPS: The Propagation Loop Cannot Stay on the GPU

This is the single most important constraint when targeting Apple Silicon. On NVIDIA GPUs, the entire propagation loop — all rounds until convergence — can run entirely on the GPU. The CPU hands off the problem and waits idle. On Apple MPS this is not possible. After every round, the GPU must hand results back to the CPU, the CPU checks whether any values were pruned and whether to continue, then sends the GPU into the next round. Every round, back and forth.

The consequence: MPS always pays the cost of 2 CPU-GPU handoffs per round, regardless of how the computation is structured. On a 24-round instance that is 48 handoffs vs NVIDIA's 2 total. This is an architectural ceiling — not an implementation choice that can be optimised away.

⑦ Two-Buffer Semantics: CPU and GPU Must Agree

At the start of each round, the GPU reads the current domains and writes pruned results to a separate copy. At round end the copies swap. This means all threads in a round see the same snapshot — no thread sees another thread's prunings mid-round. The CPU reference implementation does the same. If the GPU used in-place updates (writing back to the same copy it's reading) while the CPU used two-buffer, they would see different intermediate states within each round and disagree on the final result — not because either is wrong, but because they are operating on a subtly different version of the algorithm.

05 Four Implementations Compared

The solver was developed iteratively. Here are the four main variants, what each one does, and how they perform.

SLOWER

MPS Bucketed (earlier approach)

Variables grouped by degree into separate batches — one GPU operation per degree bucket per round. Typically 13–19 batches per round. Between batches the GPU sits idle while the CPU sets up the next group.

Replaced by single-batch approach. Too many small dispatches per round.

BEST MPS

MPS v3 Two-Buffer

Single rectangular batch per round. Padded to max_degree. Dense tensor ops: &, any(dim=3), all(dim=1). Arc flags on CPU. 2 CPU-GPU syncs per round.

Best on Apple MPS for multi-round (UNSAT) cases.

MPS VARIANT

MPS v3 In-Place

Single buffer. Writes (true→false only) are monotonic — no value ever comes back. Round count is identical to two-buffer because index_select snapshots neighbor domains before any writes, so within-round visibility is the same. The benefit is avoiding the end-of-round copy_ and buffer swap. Final domains at convergence are identical.

Why monotonic writes make races harmless: if two GPU threads race to write the same domain cell, both are writing False (a deletion). There is no winning or losing write — both threads reach the same outcome regardless of order. A cell can never be erroneously restored to True by a late write, because no thread ever writes True.

Faster on SAT cases. Mixed results on UNSAT.

SLOWER

MPS GPU-Resident

Attempted to move arc flag updates entirely to GPU using index_fill_ calls. Eliminated CPU round-trips. But: many small GPU dispatches + no MPS grid.sync() = slower than CPU-managed flags.

111ms vs 52ms on hard case. Reverted.

CUDA Cooperative Kernel — How It Works

Not yet experimented with on my machine

My machine runs Apple Silicon (M4-class). CUDA is unavailable here — the cooperative kernel path exists in gpu_ac_v2.py and gpu_ac_v3.py but has not been benchmarked or validated in my environment. The description below is based on the implementation and design intent, not measured results.

The CUDA kernel runs entirely on the GPU. All rounds happen inside a single kernel launch. grid.sync() acts as a barrier between phases — every thread waits until all threads globally are done before moving to the next phase.

// CUDA kernel — one block per variable, one warp per value
for (round = 0; round < max_rounds; round++) {

    // PHASE 1: Revise domains (all blocks in parallel)
    for (var_i = blockIdx.x; var_i < n_vars; var_i += gridDim.x) {
        // threadIdx.y = value vi, threadIdx.x = vj scan lane (0..31)
        for each arc (var_i → var_j):
            // Cache check: lane 0 checks residual, broadcasts to warp
            cache_hit = __shfl_sync(0xffffffff, check_residual(), 0);
            if (cache_hit) continue;

            // Vectorised scan: ALL 32 lanes check 32 vj positions at once
            ballot = __ballot_sync(0xffffffff, alive[vj] && supports[vi,vj]);
            // ballot is a 32-bit mask — any bit set = support found
    }

    grid.sync();   // wait for ALL threads globally

    // PHASE 2: Update arc flags
    // PHASE 3: Termination check

    grid.sync();
    if (result[0] != 0) return;   // done!
}

MPS Single-Batch — The Key Insight

An earlier "bucketed" approach grouped variables by degree, creating 13–19 separate batches per round. Each batch launched its own chain of MPS kernels. The dispatch overhead was enormous.

The current approach: one batch, one chain, every round. Variables with fewer arcs are padded to max_degree — the wasted computation on padding positions is negligible compared to the reduced dispatch overhead.

Bucketed Approach vs Single-Batch

❌ Bucketed: 13–19 batches/round

Degree-1 vars → batch 1 → kernel dispatch

Degree-2 vars → batch 2 → kernel dispatch

... × 17 more batches ...

Total: ~227ms on hard case

✓ Single-batch: 1 batch/round

All active vars padded to max_degree

One rectangular tensor → one kernel chain

Padding masked via valid_arc_mask

Total: ~57ms on hard case (4x faster)

06 Benchmark Results

Measured on Apple Silicon M4, PyTorch 2.11.0, MPS backend. 5 runs after 2 warmup runs. Times are averages in milliseconds.

What "Hard" Means

The "hard" case is Model B with n=60 variables, d=20 values, density=0.35, tightness=0.75, seed=0. It takes 24 propagation rounds before proving UNSAT. This is the real stress test — the solver that wins here is genuinely faster, not just faster at launching kernels.

Instance	Precompute (shared, one-off)	MPS single-batch	MPS in-place	GPU-resident	Bucketed MPS MPS baseline

Key Observations

Tiny instances: CPU wins. MPS launch overhead (~1–3ms) dominates when the actual work is <0.1ms.

Medium instances (1 round): MPS v3 achieves 14–40x speedup. Single-batch approach shines.

Hard instance (24 rounds, UNSAT): MPS v3 achieves 3.8x speedup and is 4x faster than the bucketed approach.

GPU-resident arc flags: Slower on the hard case (111ms vs 57ms). Apple MPS has no grid.sync() — many small dispatches add up.

Precompute cost: The support mask build (CPU, one-off) grows fast — 32ms for the hard case, 531ms for xlarge. For single-propagation use, this dominates total time regardless of which backend runs propagation. The GPU advantage is most meaningful when the same precomputed masks are reused across many propagation calls.

When Is GPU Worth It?

Based on the results above, the GPU tends to help when at least two of these are true:

Large problem — n ≥ 60 variables with d ≥ 20 values. Below this, MPS launch overhead dominates the actual work.
Multi-round propagation — the instance requires many rounds (e.g., UNSAT detection). Single-round instances give the GPU less to amortise its overhead over.
Masks reused — precomputed support masks are shared across multiple propagation calls (e.g., solving many instances with the same constraint structure, or repeated propagation inside a search). For single-call use, precompute time can dwarf the solve time on both backends.

The benchmarked instances are small by constraint-programming standards. In practice, a solver embedded in a branch-and-bound search would call propagation thousands of times per run — and reuse the same precomputed masks each time. That's the use case where the GPU advantage compounds.

End-to-End Time: Precompute + Solve

The table above shows solve-only times. Adding precompute (which is shared and one-off, but still costs time on first call) changes the picture considerably for large instances:

Instance	CPU total	Best MPS total	Solve-only speedup	End-to-end speedup
XCSP3 (3 vars)	0.04 ms	1.57 ms	0.02×	0.03×
Easy (n=40)	12.3 ms	4.2 ms	9.1×	2.9×
Medium (n=60, 1 round)	69.0 ms	30.5 ms	16.2×	2.3×
Stress-test (n=60, 24 rds)	249.6 ms	83.3 ms	3.8×	3.0×
Large (n=120)	422.9 ms	221.2 ms	29.5×	1.9×
XLarge (n=180)	1033.1 ms	541.0 ms	51.1×	1.9×

The 29–51× solve-only speedup on large instances collapses to ~1.9× end-to-end because precompute (531ms for xlarge) dominates both CPU and MPS total time equally. The GPU advantage is real only when masks are built once and reused across many calls — for example, solving many instances with the same constraint structure, or repeated propagation within a search.

07 What Failed and Why

Every failed experiment taught something. These are the paths that were tried, measured, and rejected.

① Per-Value Residual Checks in PyTorch — Too Branchy for MPS

The first attempt implemented arc consistency the natural way in PyTorch: for each value, check the cached support position, then scan the neighbour domain. This involved Python-level loops and per-value conditionals. MPS does not handle this kind of branchy, value-by-value logic efficiently — it essentially serialised the computation. Result: barely faster than CPU (227ms vs 218ms) on the hard case, and slower on easy cases.

Lesson: MPS needs dense uniform tensor operations, not per-value conditional logic.

② In-Place MPS Schedule (First Attempt)

The first in-place attempt wrote domain deletions directly into the single buffer during the sweep — no snapshot. This made early deletions immediately visible to later checks in the same round. The behavior genuinely differed from two-buffer semantics: round count shifted from 24 to 14 on the hard case, and the final domains diverged from the CPU reference.

Lesson: "Fewer rounds" is not the acceptance bar. matches_cpu=True on the hard case is.

The v3 In-Place Variant Is a Different Thing

The v3 in-place solver uses index_select to gather neighbor domains before the support check. This creates a snapshot — neighbor domains are captured at round start, before any writes land. So v3 in-place and v3 two-buffer see exactly the same intermediate states and produce exactly the same round count. The only difference is that in-place skips the end-of-round copy_ and buffer swap. Both variants agree with the CPU reference on the hard case.

③ GPU-Resident Arc Flag Updates

Arc flags live on CPU and get updated with NumPy after each round (2 CPU-GPU syncs). An experiment moved flag updates to the GPU using index_fill_ calls. This eliminated CPU computation but required many small GPU dispatches. Result: 111ms vs 57ms on the hard case — twice as slow.

Lesson: Apple MPS has no grid.sync(). Small GPU dispatches are expensive. CPU-managed flags with 2 syncs per round is cheaper than many GPU dispatches.

④ Sparse Neighbor Gathering

Instead of passing the full neighbor domain (all d values including dead ones), gather only the alive neighbor values. This reduces the size of the support check. But building the gather index tensors cost more than the saving from reduced computation. The dense approach (mask dead values in the support check) was faster.

Lesson: Index-gather overhead on MPS is not free. Dense masked operations often beat sparse ones.

⑤ Vectorised Neighbor-Load Estimation

The bucketed approach estimated "neighbor load" (total alive values across all neighbors) using a matrix multiply to group variables into similar-cost batches. This overhead was not worth it — the groups weren't uniform enough to help, and the estimation cost was measurable.

Lesson: Fancy load balancing helps only if batches are otherwise very unbalanced. The single-batch approach sidestepped this entirely.

⑥ Degree Bucketing (Multi-Batch)

The original MPS approach grouped variables by degree (number of neighbors) and processed each degree level separately, creating 13–19 batches per round. Each batch launched its own chain of MPS kernels. Kernel launch overhead dominated. The single padded batch replaced all of them.

Lesson: One large regular batch beats many small irregular ones, even with wasted computation on padding positions.

08 The Apple GPU Limitation

The most fundamental difference between NVIDIA CUDA and Apple MPS for this algorithm: one can stay on the GPU forever, the other cannot.

The Constraint: Every Round Returns to the CPU

On NVIDIA, the entire propagation loop — all rounds until convergence — runs inside a single GPU invocation. The CPU hands off the problem and waits. On Apple MPS, this is not possible. After every round the GPU hands results back to the CPU, the CPU checks convergence and schedules the next round, then the GPU starts again. (The CUDA implementation detail is covered in section 05.)

CPU-GPU Sync Pattern — click to animate

What Happens Each Round

Because MPS must return to the CPU between rounds, the CPU also handles arc-flag bookkeeping — which turns out to be cheaper than dispatching extra GPU operations for it:

CPU: Check arc flags

Which variables have at least one active arc? Fast NumPy operation.

GPU: Dense support check

One operation chain covering all variables: gather → & → any → all → write.

GPU → CPU: 2 handoffs

Read whether any domain went empty. Read which variables changed.

CPU: Update arc flags

Turn off processed arcs, wake reverse arcs of changed variables.

What Would Change Everything

If Apple added a grid.sync() equivalent to Metal:

The propagation loop could run entirely on the GPU
CPU-GPU syncs would drop from 2/round to 2 total
Arc flags could live on the GPU
Termination detection would happen inside the kernel

Until then, the best strategy is: minimize round-trips, maximize work per GPU dispatch.

The Fundamental Trade-off

For arc consistency on a single CSP instance, the hard case (24 rounds) is the real challenge. MPS achieves 3.8x speedup by doing heavy tensor work each round. CUDA achieves more by never leaving the GPU at all. The GPU-resident cooperative kernel pattern is simply not available on Apple Silicon today.

09 What Comes Next

The MPS implementation is working and the design is understood. The natural next step is to validate the same approach on hardware that removes the core constraint.

Try on CUDA

The entire MPS implementation was built under one constraint: no grid.sync(). Every design decision — single-batch, CPU-managed arc flags, 2 syncs per round — is a response to that limitation. CUDA removes it.

The open question is whether the MPS-style dense single-batch approach outperforms the CUDA cooperative kernel on actual NVIDIA hardware, or whether keeping the loop GPU-resident dominates once the architectural ceiling is gone. The two approaches have never been benchmarked head-to-head on the same machine.

Current state (MPS)

2 CPU-GPU syncs per round
CPU manages arc flags each round
48 sync points on a 24-round instance
Dense single-batch tensor ops
Loop cannot stay on GPU

What CUDA unlocks

2 CPU-GPU syncs total via grid.sync()
Arc flags live on GPU, updated in-kernel
Loop never leaves the GPU
Convergence check happens inside the kernel
Direct test of whether the design holds

Adaptive CPU–GPU Switching Per Search Node

Arc consistency is rarely used in isolation — it sits inside a backtracking search, called at every node after a variable assignment. Near the root of the search tree, domains are large and the GPU earns its keep. But each assignment shrinks domains further. By the time you are 20 decisions deep, a variable might have 2 or 3 values left, and launching a GPU kernel to check 3 values costs more than just doing it on the CPU.

A natural question is whether the solver should track average domain size at each node and switch between GPU and CPU arc consistency dynamically — GPU when domains are dense, CPU when they are not. The crossover point is roughly when alive values per variable drops below the warp size (32), at which point threads are no longer filling warps. The complication is transfer cost: if domain state lives on the GPU, switching to CPU means copying it back. Whether that cost is worth paying depends on how many rounds remain and how quickly domains are shrinking.

10 References

The arc consistency literature spans five decades. These are the papers that shaped how the algorithm is understood today, plus the GPU programming resources that informed this project.

Arc Consistency Algorithms

Mackworth, A.K. (1977)

Consistency in Networks of Relations. Artificial Intelligence, 8(1), 99–118.

AC-1 · AC-2 · AC-3 — the founding paper. Introduced the arc consistency concept and the first three algorithms.

Mohr, R. & Henderson, T.C. (1986)

Arc and Path Consistency Revisited. Artificial Intelligence, 28(2), 225–233.

AC-4 — optimal worst-case complexity O(ed²); processes supports rather than arcs.

Bessière, C. & Régin, J.C. (1996)

MAC-CBJ: Combining Backjumping and Maintaining Arc Consistency. AAAI 1996.

AC-6 — reduces redundant support checks; maintains a single support witness per value.

Zhang, Y. & Yap, R.H.C. (2001)

Making AC-3 an Optimal Algorithm. IJCAI 2001, 316–321.

AC-3.1 — patches AC-3 to achieve optimal O(ed²) by avoiding redundant constraint re-checks.

Bessière, C., Régin, J.C., Yap, R.H.C., & Zhang, Y. (2005)

An Optimal Coarse-Grained Arc Consistency Algorithm. Artificial Intelligence, 165(2), 165–185.

AC-2001 — practically fast and theoretically optimal; widely used as a baseline in modern solvers.

Lecoutre, C. (2009)

Constraint Networks: Techniques and Algorithms. ISTE / Wiley.

Comprehensive textbook covering AC variants, complexity analysis, and solver design in depth.

GPU Architecture & Programming

NVIDIA Corporation (2023)

CUDA C++ Programming Guide. NVIDIA Developer Documentation.

Warps, memory coalescing, shared memory, and cooperative groups — the reference for everything GPU-side in this project.

Apple Inc. (2023)

Metal Performance Shaders Framework. Apple Developer Documentation.

MPS tensor ops and the absence of grid-wide synchronisation — the constraint that shaped every design decision here.

Owens, J.D. et al. (2008)

GPU Computing. Proceedings of the IEEE, 96(5), 879–899.

Foundational survey of the GPU programming model — useful context for understanding where constraint propagation fits.