Ticket Locks: The Architecture of Predictability
In the deep waters of systems programming, raw speed is often less important than predictable latency.
Standard spinlocks, typically built around atomic Compare-and-Swap (CAS) loops, suffer from a critical flaw under high contention: unfairness. When a lock is released, the hardware arbiter decides who gets it next based on processor topology or sheer luck, not waiting time. This leads to starvation and erratic P99 latency spikes.
Enter the Ticket Lock. It trades the chaotic scramble of CAS for the orderly discipline of a FIFO queue, ensuring that every thread gets its turn.
1. The Deli Counter Model
The mechanism is simple, mirroring a physical "take a number" system:
next_ticket: The dispenser. Threads atomically fetch-and-increment this value to get their position in line.current_ticket: The display. Shows who is currently being served.
Threads spin on current_ticket until it matches their hand-held number. This guarantees strict FIFO ordering—no one cuts in line.
2. Zig Implementation: Explicit Memory Layout
One of Zig’s superpowers is its explicit control over memory alignment. In high-performance locking, where variables sit in memory matters as much as the algorithm itself.
Here is a production-grade implementation that respects CPU cache lines:
const std = @import("std");
const atomic = std.atomic;
/// TicketLock: A fair synchronization primitive
/// Enforces strict FIFO ordering via dual counters
pub const TicketLock = struct {
// Critical Optimization: Align counters to 64 bytes (typical cache line)
// to prevent False Sharing between the dispenser and the display.
next_ticket: atomic.Value(u64) align(64) = atomic.Value(u64).init(0),
current_ticket: atomic.Value(u64) align(64) = atomic.Value(u64).init(0),
/// Acquire: Take a ticket and wait
pub fn acquire(self: *TicketLock) void {
// 1. Atomic Fetch-and-Add
// This single atomic write establishes the global order.
const my_ticket = self.next_ticket.fetchAdd(1, .Monotonic);
// 2. Spin Wait (Read-Only)
// Spin until the display shows our number.
while (self.current_ticket.load(.Acquire) != my_ticket) {
// Hint to the CPU that we are in a spin loop to save power/pipeline slots
std.atomic.spinLoopPause();
}
}
/// Release: Call the next number
pub fn release(self: *TicketLock) void {
// 3. Store-Release
// Incrementing this value unblocks the specific thread holding (my_ticket + 1).
const current = self.current_ticket.load(.Unordered);
self.current_ticket.store(current + 1, .Release);
}
};
3. The Cost of False Sharing
Notice the align(64) directives. These aren't just for show.
In modern multi-core processors, the L1/L2 cache operates on "cache lines" (usually 64 bytes). If next_ticket (written by arriving threads) and current_ticket (written by the lock holder) reside on the same cache line, cores will constantly invalidate each other's caches via the MESI protocol, even though they are touching different variables. This phenomenon is called False Sharing, and it can decimate performance.
By forcing alignment, we ensure these hot variables live in separate cache lines, minimizing unnecessary bus traffic.
4. The Trade-off: Global Spinning
While Ticket Locks solve fairness, they are not a silver bullet. They still suffer from global spinning:
- All waiting threads spin on the same variable (
current_ticket). - When
current_ticketis updated, every waiting core invalidates its cache line and re-reads from memory. - On a machine with 100+ cores, this "thundering herd" of cache coherence traffic can saturate the interconnect.
For extreme scale, MCS locks (where each thread spins on its own local variable) are the superior choice, albeit at the cost of complexity.
Conclusion
The Ticket Lock is a beautiful example of how algorithmic fairness can reduce tail latency. It teaches us that:
- Fairness is an architectural choice: We accept slightly higher overhead for stable, predictable performance.
- Hardware empathy is non-negotiable: Simple alignment tweaks can prevent invisible performance cliffs.
- Order matters: In a chaotic system, a simple queue is often the most robust solution.