Speculative Parallel Rejection Sampling in ML-DSA Signing
Speculative Parallel Rejection Sampling in ML-DSA Signing
ML-DSA signing uses the Fiat-Shamir with Aborts construction. The signer picks a random masking vector y, computes a candidate signature, checks norm bounds, and either accepts or discards and retries. The acceptance probability per attempt is roughly 19.6% for ML-DSA-65, meaning the signer needs about 5.1 iterations on average. The distribution is geometric with a real tail - reaching the 99th percentile requires around 21 iterations.
This is a latency problem. Each iteration is fast (50-200us with AVX2-optimized NTT and 4-way Keccak), but the variable number of retries creates unpredictable signing times. For a system budgeting 1ms and hitting a 99th-percentile case at 2.1ms, that’s a missed deadline.
The idea: speculative parallel attempts
Each rejection sampling iteration is independent. Different iterations use different random y values (parameterized by the loop counter kappa in FIPS 204). Nothing prevents running multiple iterations concurrently and taking the first one that succeeds.
With k=8 parallel threads, the probability that at least one succeeds per round is 1-(1-p)^8. For ML-DSA-65 (p=0.196), that’s about 83% per round, giving an expected 1.2 rounds to produce a valid signature. Wall-clock speedup: roughly 4.2x. The tail collapses dramatically - what sequentially required 10+ iterations to reach 89% confidence is achieved in a single parallel round.
The tradeoff is compute: 8 threads x 1.2 rounds = ~9.6 iterations of total work versus ~5.1 sequentially. Throughput drops by about 1.9x. This is a latency-for-throughput trade, worth it when cores are available and signing latency matters.
Implementation
The computation per iteration is small, so thread management overhead matters. Two practical approaches:
In C, OpenMP with a persistent thread pool and active wait policy (OMP_WAIT_POLICY=active) gives near-zero dispatch latency. Threads spin between parallel regions, an atomic flag signals completion, and losers drain out. This is three lines of pragmas on top of existing signing code.
In Rust, rayon’s scoped parallelism with an AtomicBool cancellation flag is the idiomatic choice. Threads park via futex rather than spinning, adding a few microseconds of wake-up latency – negligible against a 50-200us iteration. No spin-wait thread pool crate currently exists in the Rust ecosystem; a custom one is ~50 lines but rarely justified.
The critical implementation detail: each thread needs its own RNG state and keccak context. Assign non-overlapping kappa ranges (thread 0 gets kappa=0,8,16,…; thread 1 gets kappa=1,9,17,…) to ensure independent randomness without sharing mutable state.
What about SIMD?
SIMD parallelism (AVX2 4-way Keccak, AVX-512 NTT vectorization) is already used within each iteration to accelerate the internal polynomial arithmetic and hashing. It is orthogonal to thread-level parallelism across iterations. The two compose: 8 threads each running AVX2-optimized iterations gives you both fast individual iterations and parallel speculative attempts.
Prior work
The idea of parallel speculative rejection sampling appears in the threshold ML-DSA literature. TOPCOAT (Agrawal et al.) runs theta parallel signing iterations to avoid expensive communication-round restarts. The “Efficient Threshold ML-DSA” paper (NIST 6th PQC Conference, 2025) similarly runs K>1 iterations in parallel per party. Seo and An implemented parallel Dilithium signing on an embedded GPU for autonomous driving, achieving 19.41x speedup on NVIDIA Jetson AGX Xavier using rejection sequence tables and parallel NTT.
For standard (non-threshold) ML-DSA on multi-core CPUs, the approach is straightforward and effective but appears underexplored in the literature. The geometric distribution of signing attempts makes it a natural fit for speculative execution, and modern hardware with 8+ cores and per-core L1/L2 caches eliminates the traditional objections around cache contention.
Summary
The goal of this blog is to describe my next project. The results will be reported shortly, hopefully with the Rust implementation.