Arc consistency is a simple rule: every value in a variable's domain must have at least one valid partner in each neighbour's domain. This is an exploration of what it actually takes to run that rule on a GPU — the design choices, the trade-offs, and what the hardware fights against.
This is an exploration of what it takes to move arc consistency to a GPU — the design choices, the constraints, and the challenges that arise when translating a fundamentally sequential algorithm into parallel hardware. It is not a rigorous performance benchmark.
The scope is restricted to extensional binary constraints — constraints defined by explicit support tables rather than arithmetic expressions. This makes support mask precomputation straightforward and keeps the GPU data structures simple. Intensional or global constraints are out of scope.
The starting point is a basic arc consistency implementation, not the most efficient algorithm in the literature. Optimising the algorithm itself and comparing against state-of-the-art CPU solvers was not the goal. The purpose was to understand the shape of the problem: what has to change, what the GPU fights against, and where it can win.
The results do suggest that for sufficiently large problems, GPU acceleration can pay off — but that finding is a by-product of the exploration, not its conclusion.
This is early-stage work. The implementations are functional and produce correct results on tested cases, but have not been audited for edge cases or hardened for production use. There is likely room for bugs.
Before searching for a solution, we can eliminate huge numbers of impossible values. Arc consistency does exactly this — and doing it fast matters enormously.
A Constraint Satisfaction Problem (CSP) has three ingredients: variables (things we need to assign), domains (the possible values for each variable), and constraints (rules that pairs of variables must satisfy).
Three variables: X, Y, Z. Constraint: X < Y < Z. Domains: X = {1,2,3,4}, Y = {1,2,3,4}, Z = {1,2,3,4}.
Before we search for an assignment, can we already tell some values are impossible?
A value is impossible if no valid assignment of its neighbor can coexist with it. The rule is:
Click Next Step to walk through pruning X, Y, Z with the constraint X < Y < Z. Watch dead values (❌) get eliminated as we scan neighbors.
In real CSPs with 60 variables and 20 values each, a brute-force search over all assignments would check up to 2060 combinations. Arc consistency can eliminate enough values that search becomes tractable — or it can immediately detect that no solution exists (UNSAT).
The solver doesn't find a solution directly. It just keeps deleting things that cannot possibly be in a solution, until either nothing more can be deleted (done!) or some variable runs out of values (UNSAT — no solution exists).
Before considering GPUs, let's understand the data structures for arc consistency.
Instead of storing lists, we use a boolean matrix: one row per variable, one column per value. true = alive, false = dead.
| v=1 | v=2 | v=3 | v=4 | |
|---|---|---|---|---|
| X | T | T | T | T |
| Y | T | T | T | T |
| Z | T | T | T | T |
After convergence, the matrix will show only the values that survive. This matrix is the core data structure uploaded to the GPU.
Asking "does X=2 have a compatible partner in Y?" requires checking every Y value against the constraint. We'd have to call the constraint function thousands of times. Instead, we precompute a lookup table — the support mask.
Constraint functions are Python lambdas — they cannot run on the GPU. We must evaluate them on the CPU first and bake the results into boolean arrays that the GPU can read.
| Y=1 | Y=2 | Y=3 | Y=4 | |
|---|---|---|---|---|
| X=1 | F | T | T | T |
| X=2 | F | F | T | T |
| X=3 | F | F | F | T |
| X=4 | F | F | F | F |
Read row X=3: it is compatible with Y=4 only. If Y=4 is later deleted, X=3 loses its last support and must also be deleted.
The mask for arc Y → X is just the transpose of the X → Y mask. This can halve the precomputation time.
For each variable, we need to know its neighbors fast. We store this in CSR (Compressed Sparse Row) format — three flat arrays that are cache-friendly and GPU-uploadable. CSR was formalised in the Yale Sparse Matrix Package (Eisenstat et al., 1977) and has since become the standard in sparse linear algebra libraries from SciPy to NVIDIA's cuSPARSE.
| Array | Values | Meaning |
|---|---|---|
| arc_offsets | [0, 1, 3] | X starts at index 0, Y at 1, Z at 3 |
| arc_counts | [1, 2, 1] | X has 1 arc, Y has 2, Z has 1 |
| arc_list | [arc0,Y], [arc1,X], [arc2,Z], [arc3,Y] | X→Y, Y→X, Y→Z, Z→Y |
Each entry in arc_list is [arc_id, neighbor_variable]. The arc_id indexes into the support mask array. This flat layout avoids pointer chasing — critical for GPU performance.
Instead of tracking "which variables need to be rechecked," v3 tracks which arcs need processing. This is finer-grained — we can skip an arc that hasn't been affected by any deletion.
We keep two copies of the domain matrix — one to read from (domains_cur), one to write to (domains_nxt). This ensures that deletions made during a round don't affect other variables' checks in the same round.
# Two-buffer round — read from cur, write to nxt
for each active variable X:
for each alive value x in X:
survived = True
for each arc (X → Y):
# Check: does Y still have any value compatible with x?
has_support = any(support_mask[X→Y][x, y] and domains_cur[Y, y]
for y in range(domain_size))
if not has_support:
survived = False
break
domains_nxt[X, x] = survived
# After all variables: swap buffers, detect changes, update arc flags
domains_cur, domains_nxt = domains_nxt, domains_cur
If we modified the domain matrix in place, an early deletion could affect later checks in the same pass — changing which values survive. Two buffers guarantee all variables see the same frozen snapshot within one round. This is essential for determinism on parallel hardware.
One deletion can trigger more deletions, which trigger more. The solver keeps running rounds until nothing changes. Here's why multiple rounds are needed:
The key observation: a deletion in one variable can cascade to force deletions in its neighbors, which cascade further. On a benchmark instance (60 variables, 20 values), this takes 24 rounds before either converging or proving UNSAT.
Not everything can or should run on the GPU. This diagram shows the full algorithm from setup to result, with each step coloured by where it executes. Toggle between the CUDA and MPS variants to see how the split changes. Note: the CUDA variant shown here was not tested on hardware.
(n_arcs, d, d)arc_offsets, arc_counts, arc_listarc_idx_padded, valid_arc_maskWith 24 rounds on the hard case: 48 total syncs. Each sync flushes the GPU pipeline. The round-trip overhead is the primary bottleneck for multi-round workloads.
The CPU version is clean, natural code. Moving to a GPU requires rethinking almost everything. Here are the seven non-obvious lessons learned from this project.
On CPU, arc consistency checks values sequentially — value x=0, then x=1, then x=2. For each value, it scans neighbour values y=0, y=1... stopping as soon as support is found. One value at a time, one neighbour at a time.
On GPU the computation is fundamentally different. In MPS, a single tensor operation simultaneously determines whether support exists for every value of every variable. The expression support_mask & nbr_domains followed by any(dim=3) covers the entire domain in one shot — potentially thousands of (variable, value) pairs checked in parallel. There is no x=0, then x=1. There is no y=0, then y=1. Whether support is found at y=0 or y=24 makes no difference to cost — both are part of the same operation.
In CUDA the model is different: the GPU assigns one thread to each value — thread 0 is responsible for x=0, thread 1 for x=1, and so on. All threads run simultaneously, so support is checked for all values in parallel. The neighbour scan is further parallelised using __ballot_sync: a group of 32 threads (called a warp) simultaneously check 32 Y values and vote — if any thread's Y value provides support, the whole group knows immediately.
This means the CPU implementation cannot be directly ported to GPU — it has to be rewritten. The natural CPU structure (loop over values, check each one, prune if no support) doesn't translate. The algorithm must be reformulated as a tensor operation that computes support for all values simultaneously. Same result, fundamentally different structure.
support_mask
& nbr_domains
→ any(dim=3)
Constraints like lambda x, y: x < y are Python functions. GPUs cannot execute Python. Every constraint must be evaluated on the CPU first and stored in a boolean support mask. The GPU only reads from this precomputed table — it never evaluates constraints itself.
# CPU: evaluate all pairs once, store result as boolean matrix
for vi in range(domain_size):
for vj in range(domain_size):
mask[vi, vj] = constraint(vi, vj) # e.g. lambda x,y: x < y
# Upload mask to GPU. From here, constraint is never called again.
support_t = torch.tensor(mask, dtype=torch.bool, device='mps')
On CPU, caching the last known supporting value is powerful: if the cached value is still alive, you skip the full scan — O(1) instead of O(d).
On GPU this benefit evaporates — for two different reasons depending on backend. In CUDA, the 32 threads in a warp always advance together in lockstep (they cannot finish independently). Even if one thread hits the cache and could stop early, the entire group waits for the slowest. The round takes the same time whether the cache hits or not. In MPS, there is no per-value sequential scan at all — the support check is a single tensor operation over the entire domain. There is no position to cache and no scan to exit early from; whether support is at y=0 or y=24 makes no difference to cost.
The consequence is the same in both cases: optimisations that exploit sequential order — caching the last support position, early exit when support is found — do not carry over to GPU.
# support_batch: (n_active, max_degree, d, d) — support masks for active vars
# nbr_domains: (n_active, max_degree, d) — alive values in each neighbor
# "Is (vi, vj) a valid pair AND is vj alive?"
candidate = support_batch & nbr_domains[:, :, None, :]
# ↑ shape: (n_active, max_degree, d, d)
# "Does any vj survive?" → True if ANY column in a vi-row is True
support_exists = torch.any(candidate, dim=3)
# ↑ shape: (n_active, max_degree, d)
# "Did vi survive?" → alive AND all neighbors have support
survived = domain_batch & torch.all(
support_exists | (~domain_batch)[:, None, :] | (~batch_valid)[:, :, None],
dim=1,
)
No Python loops over values. No branching. One rectangular tensor operation that the MPS scheduler can dispatch as a single GPU kernel. The padding positions are masked by ~batch_valid so they don't affect the result.
Variables have different numbers of neighbors — this is called their degree. To process all variables together in a single GPU operation, every row of the tensor must be the same width. We pad every variable's arc list to max_degree slots using dummy entries and a validity mask. Padding slots are masked out and never affect results — but their presence is what makes the tensor rectangular so one operation covers everyone.
# Padded arc structures — filled for all variables up to max_degree
arc_idx_padded[i, k] = arc_id # real arc index, or 0 for padding
neighbor_idx_padded[i, k] = neighbor_var # real neighbor, or 0 for padding
valid_arc_mask[i, k] = True # False for padding positions
# Variables with fewer arcs just have trailing False entries in valid_arc_mask
| variable | degree | arc slot 0 | arc slot 1 | arc slot 2 | arc slot 3 | arc slot 4 |
| X (deg 2) | 2 | X→Y ✓ | X→Z ✓ | pad ✗ | pad ✗ | pad ✗ |
| Y (deg 5) | 5 | Y→X ✓ | Y→Z ✓ | Y→A ✓ | Y→B ✓ | Y→C ✓ |
| Z (deg 3) | 3 | Z→X ✓ | Z→Y ✓ | Z→W ✓ | pad ✗ | pad ✗ |
All three variables sit in the same rectangular tensor with width 5 (= max_degree). Green slots are real arcs included in the support check. Red slots are padding — their valid_arc_mask entry is False, so the ~batch_valid term in the final all() call treats them as "already satisfied" and they never prune a value.
AC-3 has a well-known refinement: when variable X is revised because of arc (Y→X), there's no need to recheck arc (X→Y) — Y just constrained X, so X can't have gained new values that would break Y. On CPU this saves real work.
On GPU this doesn't apply. All arcs in a round are processed simultaneously — the GPU has no record of why a variable changed or which arc caused it. Every changed variable wakes all its reverse arcs, conservatively. Tracking that provenance in parallel would require every thread to record what it changed and why, and multiple threads writing to the same record simultaneously would corrupt each other's data. The cost outweighs the benefit.
This is the single most important constraint when targeting Apple Silicon. On NVIDIA GPUs, the entire propagation loop — all rounds until convergence — can run entirely on the GPU. The CPU hands off the problem and waits idle. On Apple MPS this is not possible. After every round, the GPU must hand results back to the CPU, the CPU checks whether any values were pruned and whether to continue, then sends the GPU into the next round. Every round, back and forth.
The consequence: MPS always pays the cost of 2 CPU-GPU handoffs per round, regardless of how the computation is structured. On a 24-round instance that is 48 handoffs vs NVIDIA's 2 total. This is an architectural ceiling — not an implementation choice that can be optimised away.
At the start of each round, the GPU reads the current domains and writes pruned results to a separate copy. At round end the copies swap. This means all threads in a round see the same snapshot — no thread sees another thread's prunings mid-round. The CPU reference implementation does the same. If the GPU used in-place updates (writing back to the same copy it's reading) while the CPU used two-buffer, they would see different intermediate states within each round and disagree on the final result — not because either is wrong, but because they are operating on a subtly different version of the algorithm.
The solver was developed iteratively. Here are the four main variants, what each one does, and how they perform.
Variables grouped by degree into separate batches — one GPU operation per degree bucket per round. Typically 13–19 batches per round. Between batches the GPU sits idle while the CPU sets up the next group.
Replaced by single-batch approach. Too many small dispatches per round.
Single rectangular batch per round. Padded to max_degree. Dense tensor ops: &, any(dim=3), all(dim=1). Arc flags on CPU. 2 CPU-GPU syncs per round.
Best on Apple MPS for multi-round (UNSAT) cases.
Single buffer. Writes (true→false only) are monotonic — no value ever comes back. Round count is identical to two-buffer because index_select snapshots neighbor domains before any writes, so within-round visibility is the same. The benefit is avoiding the end-of-round copy_ and buffer swap. Final domains at convergence are identical.
Why monotonic writes make races harmless: if two GPU threads race to write the same domain cell, both are writing False (a deletion). There is no winning or losing write — both threads reach the same outcome regardless of order. A cell can never be erroneously restored to True by a late write, because no thread ever writes True.
Faster on SAT cases. Mixed results on UNSAT.
Attempted to move arc flag updates entirely to GPU using index_fill_ calls. Eliminated CPU round-trips. But: many small GPU dispatches + no MPS grid.sync() = slower than CPU-managed flags.
111ms vs 52ms on hard case. Reverted.
My machine runs Apple Silicon (M4-class). CUDA is unavailable here — the cooperative kernel path exists in gpu_ac_v2.py and gpu_ac_v3.py but has not been benchmarked or validated in my environment. The description below is based on the implementation and design intent, not measured results.
The CUDA kernel runs entirely on the GPU. All rounds happen inside a single kernel launch. grid.sync() acts as a barrier between phases — every thread waits until all threads globally are done before moving to the next phase.
// CUDA kernel — one block per variable, one warp per value
for (round = 0; round < max_rounds; round++) {
// PHASE 1: Revise domains (all blocks in parallel)
for (var_i = blockIdx.x; var_i < n_vars; var_i += gridDim.x) {
// threadIdx.y = value vi, threadIdx.x = vj scan lane (0..31)
for each arc (var_i → var_j):
// Cache check: lane 0 checks residual, broadcasts to warp
cache_hit = __shfl_sync(0xffffffff, check_residual(), 0);
if (cache_hit) continue;
// Vectorised scan: ALL 32 lanes check 32 vj positions at once
ballot = __ballot_sync(0xffffffff, alive[vj] && supports[vi,vj]);
// ballot is a 32-bit mask — any bit set = support found
}
grid.sync(); // wait for ALL threads globally
// PHASE 2: Update arc flags
// PHASE 3: Termination check
grid.sync();
if (result[0] != 0) return; // done!
}
An earlier "bucketed" approach grouped variables by degree, creating 13–19 separate batches per round. Each batch launched its own chain of MPS kernels. The dispatch overhead was enormous.
The current approach: one batch, one chain, every round. Variables with fewer arcs are padded to max_degree — the wasted computation on padding positions is negligible compared to the reduced dispatch overhead.
Measured on Apple Silicon M4, PyTorch 2.11.0, MPS backend. 5 runs after 2 warmup runs. Times are averages in milliseconds.
The "hard" case is Model B with n=60 variables, d=20 values, density=0.35, tightness=0.75, seed=0. It takes 24 propagation rounds before proving UNSAT. This is the real stress test — the solver that wins here is genuinely faster, not just faster at launching kernels.
| Instance | Precompute (shared, one-off) |
MPS single-batch | MPS in-place | GPU-resident | Bucketed MPS MPS baseline |
|---|
Tiny instances: CPU wins. MPS launch overhead (~1–3ms) dominates when the actual work is <0.1ms.
Medium instances (1 round): MPS v3 achieves 14–40x speedup. Single-batch approach shines.
Hard instance (24 rounds, UNSAT): MPS v3 achieves 3.8x speedup and is 4x faster than the bucketed approach.
GPU-resident arc flags: Slower on the hard case (111ms vs 57ms). Apple MPS has no grid.sync() — many small dispatches add up.
Precompute cost: The support mask build (CPU, one-off) grows fast — 32ms for the hard case, 531ms for xlarge. For single-propagation use, this dominates total time regardless of which backend runs propagation. The GPU advantage is most meaningful when the same precomputed masks are reused across many propagation calls.
Based on the results above, the GPU tends to help when at least two of these are true:
The benchmarked instances are small by constraint-programming standards. In practice, a solver embedded in a branch-and-bound search would call propagation thousands of times per run — and reuse the same precomputed masks each time. That's the use case where the GPU advantage compounds.
The table above shows solve-only times. Adding precompute (which is shared and one-off, but still costs time on first call) changes the picture considerably for large instances:
| Instance | CPU total | Best MPS total | Solve-only speedup | End-to-end speedup |
| XCSP3 (3 vars) | 0.04 ms | 1.57 ms | 0.02× | 0.03× |
| Easy (n=40) | 12.3 ms | 4.2 ms | 9.1× | 2.9× |
| Medium (n=60, 1 round) | 69.0 ms | 30.5 ms | 16.2× | 2.3× |
| Stress-test (n=60, 24 rds) | 249.6 ms | 83.3 ms | 3.8× | 3.0× |
| Large (n=120) | 422.9 ms | 221.2 ms | 29.5× | 1.9× |
| XLarge (n=180) | 1033.1 ms | 541.0 ms | 51.1× | 1.9× |
The 29–51× solve-only speedup on large instances collapses to ~1.9× end-to-end because precompute (531ms for xlarge) dominates both CPU and MPS total time equally. The GPU advantage is real only when masks are built once and reused across many calls — for example, solving many instances with the same constraint structure, or repeated propagation within a search.
Every failed experiment taught something. These are the paths that were tried, measured, and rejected.
The first attempt implemented arc consistency the natural way in PyTorch: for each value, check the cached support position, then scan the neighbour domain. This involved Python-level loops and per-value conditionals. MPS does not handle this kind of branchy, value-by-value logic efficiently — it essentially serialised the computation. Result: barely faster than CPU (227ms vs 218ms) on the hard case, and slower on easy cases.
Lesson: MPS needs dense uniform tensor operations, not per-value conditional logic.
The first in-place attempt wrote domain deletions directly into the single buffer during the sweep — no snapshot. This made early deletions immediately visible to later checks in the same round. The behavior genuinely differed from two-buffer semantics: round count shifted from 24 to 14 on the hard case, and the final domains diverged from the CPU reference.
Lesson: "Fewer rounds" is not the acceptance bar. matches_cpu=True on the hard case is.
The v3 in-place solver uses index_select to gather neighbor domains before the support check. This creates a snapshot — neighbor domains are captured at round start, before any writes land. So v3 in-place and v3 two-buffer see exactly the same intermediate states and produce exactly the same round count. The only difference is that in-place skips the end-of-round copy_ and buffer swap. Both variants agree with the CPU reference on the hard case.
Arc flags live on CPU and get updated with NumPy after each round (2 CPU-GPU syncs). An experiment moved flag updates to the GPU using index_fill_ calls. This eliminated CPU computation but required many small GPU dispatches. Result: 111ms vs 57ms on the hard case — twice as slow.
Lesson: Apple MPS has no grid.sync(). Small GPU dispatches are expensive. CPU-managed flags with 2 syncs per round is cheaper than many GPU dispatches.
Instead of passing the full neighbor domain (all d values including dead ones), gather only the alive neighbor values. This reduces the size of the support check. But building the gather index tensors cost more than the saving from reduced computation. The dense approach (mask dead values in the support check) was faster.
Lesson: Index-gather overhead on MPS is not free. Dense masked operations often beat sparse ones.
The bucketed approach estimated "neighbor load" (total alive values across all neighbors) using a matrix multiply to group variables into similar-cost batches. This overhead was not worth it — the groups weren't uniform enough to help, and the estimation cost was measurable.
Lesson: Fancy load balancing helps only if batches are otherwise very unbalanced. The single-batch approach sidestepped this entirely.
The original MPS approach grouped variables by degree (number of neighbors) and processed each degree level separately, creating 13–19 batches per round. Each batch launched its own chain of MPS kernels. Kernel launch overhead dominated. The single padded batch replaced all of them.
Lesson: One large regular batch beats many small irregular ones, even with wasted computation on padding positions.
The most fundamental difference between NVIDIA CUDA and Apple MPS for this algorithm: one can stay on the GPU forever, the other cannot.
On NVIDIA, the entire propagation loop — all rounds until convergence — runs inside a single GPU invocation. The CPU hands off the problem and waits. On Apple MPS, this is not possible. After every round the GPU hands results back to the CPU, the CPU checks convergence and schedules the next round, then the GPU starts again. (The CUDA implementation detail is covered in section 05.)
Because MPS must return to the CPU between rounds, the CPU also handles arc-flag bookkeeping — which turns out to be cheaper than dispatching extra GPU operations for it:
Which variables have at least one active arc? Fast NumPy operation.
One operation chain covering all variables: gather → & → any → all → write.
Read whether any domain went empty. Read which variables changed.
Turn off processed arcs, wake reverse arcs of changed variables.
If Apple added a grid.sync() equivalent to Metal:
Until then, the best strategy is: minimize round-trips, maximize work per GPU dispatch.
For arc consistency on a single CSP instance, the hard case (24 rounds) is the real challenge. MPS achieves 3.8x speedup by doing heavy tensor work each round. CUDA achieves more by never leaving the GPU at all. The GPU-resident cooperative kernel pattern is simply not available on Apple Silicon today.
The MPS implementation is working and the design is understood. The natural next step is to validate the same approach on hardware that removes the core constraint.
The entire MPS implementation was built under one constraint: no grid.sync(). Every design decision — single-batch, CPU-managed arc flags, 2 syncs per round — is a response to that limitation. CUDA removes it.
The open question is whether the MPS-style dense single-batch approach outperforms the CUDA cooperative kernel on actual NVIDIA hardware, or whether keeping the loop GPU-resident dominates once the architectural ceiling is gone. The two approaches have never been benchmarked head-to-head on the same machine.
grid.sync()Arc consistency is rarely used in isolation — it sits inside a backtracking search, called at every node after a variable assignment. Near the root of the search tree, domains are large and the GPU earns its keep. But each assignment shrinks domains further. By the time you are 20 decisions deep, a variable might have 2 or 3 values left, and launching a GPU kernel to check 3 values costs more than just doing it on the CPU.
A natural question is whether the solver should track average domain size at each node and switch between GPU and CPU arc consistency dynamically — GPU when domains are dense, CPU when they are not. The crossover point is roughly when alive values per variable drops below the warp size (32), at which point threads are no longer filling warps. The complication is transfer cost: if domain state lives on the GPU, switching to CPU means copying it back. Whether that cost is worth paying depends on how many rounds remain and how quickly domains are shrinking.
The arc consistency literature spans five decades. These are the papers that shaped how the algorithm is understood today, plus the GPU programming resources that informed this project.