High-Performance Rate Limiting: The Art of Atomic Bit-Packing

In the world of high-throughput gateways and load balancers, accurately and efficiently counting requests per second (RPS) is a core engineering challenge. The classic "Sliding Window" algorithm offers precision but demands storing extensive timestamps, which becomes prohibitively expensive in terms of memory and computation at scale.

A common industry optimization is the Approximated Sliding Window. Instead of a full history, it tracks just two counters: the count for the "current window" and the count for the "previous window." By linearly interpolating between them, we can estimate the current rate with $O(1)$ complexity.

However, in high-concurrency scenarios, safely updating these two counters—along with their associated timestamp—without introducing heavyweight locks (Mutexes) is the critical bottleneck. This post explores a lock-free solution using Atomic Bit-Packing and analyzes the trade-offs involved in squeezing state into CPU registers.

The Core Challenge: Multi-Field Atomicity

We need to maintain a consistent state comprising three fields:

Current Window Count
Previous Window Count
Window Timestamp

In a multithreaded environment, these three fields must be updated atomically. If we protect the struct with a Mutex, lock contention will degrade performance severely as core counts rise.

To achieve a Lock-Free design, we must leverage CPU atomic instructions like Compare-And-Swap (CAS). But standard CAS operations typically only support a single machine word (64-bit or 128-bit). How do we update three logical fields at once? The answer is Bit-Packing.

Design Pattern: Atomic Bit-Packing

The essence of bit-packing is compressing multiple small-width integers into a single physical machine word.

For example, within a 64-bit atomic integer, we can lay out the data as:

High 32 bits: Store the timestamp (in seconds, enough for ~136 years).
Middle 16 bits: Store the previous window's count.
Low 16 bits: Store the current window's count.

This layout allows us to update the timestamp and both counters simultaneously via a single CAS instruction, guaranteeing an atomic snapshot of the rate limiter's state.

The Read-Modify-Write Loop

Load: Atomically read the 64-bit integer.
Unpack: Use bitwise shifts and masks to separate Time, Prev, and Curr.
Calculate:
- If the current time has moved into a new window, shift Curr to Prev, reset Curr, and update the timestamp.
- If still in the same window, simply increment Curr.
Pack: Re-compress the new values into a 64-bit integer.
Commit (CAS): Attempt to swap the old value with the new one. If it fails (meaning another thread updated it first), retry the loop.

This "optimistic locking" pattern avoids the overhead of thread context switching and works exceptionally well for read-heavy or moderately contended write scenarios.

Trade-offs and Limitations

1. Counter Overflow

In a 64-bit layout, bits are scarce. Allocating 16 bits per counter limits the max count per window to 65,535. For massive traffic, this is insufficient.

Solutions:

Wider Word Size: Use 128-bit atomic instructions (e.g., CMPXCHG16B on x86). This allows 64 bits for the timestamp and two 32-bit counters, supporting billions of RPS. C++'s std::atomic<__int128> supports this natively on modern compilers.
Reduced Precision: Store logarithmic counts or compressed formats, sacrificing accuracy for range.

2. The ABA Problem

While CAS detects value changes, in pure numerical packing, the ABA problem (where a value changes from A to B and back to A) is usually benign because we care about state transitions. However, if pointers were packed, version tags would be mandatory to prevent use-after-free bugs.

3. CPU Overhead

Bitwise operations are extremely fast, but 128-bit atomics can be slower on some microarchitectures compared to 64-bit ones. Additionally, high contention can lead to "bus storms" from failed CAS attempts.

Demonstration (Go)

While C++ is the usual battleground for such low-level optimizations, Go can also implement this pattern using sync/atomic. Since Go's standard library primarily supports up to 64-bit atomics natively, our demo uses a 64-bit packing strategy: 32-bit Time + 16-bit Prev + 16-bit Curr.

package main

import (
	"fmt"
	"sync/atomic"
	"time"
)

// PackedCounter demonstrates packing multiple states into a single uint64
// Layout: [32-bit Timestamp] [16-bit Prev] [16-bit Curr]
type PackedCounter struct {
	state uint64
}

const (
	mask16 = (1 << 16) - 1
	shiftTime = 32
	shiftPrev = 16
)

func (c *PackedCounter) Inc(now time.Time, winSize time.Duration) {
	nowSec := uint32(now.Unix())
	winSec := uint32(winSize.Seconds())

	for {
		// 1. Atomic Load
		current := atomic.LoadUint64(&c.state)
		
		// 2. Unpack State
		ts := uint32(current >> shiftTime)
		prev := uint16((current >> shiftPrev) & mask16)
		curr := uint16(current & mask16)

		var nextTs uint32
		var nextPrev, nextCurr uint16

		// 3. Calculate New State
		// Calculate time difference to detect window rotation
		diff := (nowSec - ts) / winSec

		if ts == 0 { // Initialization
			nextTs = nowSec
			nextCurr = 1
		} else if diff == 0 { // Same window
			nextTs = ts
			nextPrev = prev
			nextCurr = curr + 1 // Note: Should handle overflow saturation
		} else { // Window rotation
			nextTs = nowSec
			// If only one window passed, Prev inherits Curr; otherwise reset to 0
			if diff == 1 {
				nextPrev = curr
			} else {
				nextPrev = 0
			}
			nextCurr = 1
		}

		// 4. Pack and CAS Commit
		nextState := (uint64(nextTs) << shiftTime) | 
		             (uint64(nextPrev) << shiftPrev) | 
					 uint64(nextCurr)
					 
		if atomic.CompareAndSwapUint64(&c.state, current, nextState) {
			return
		}
		// CAS failed, retry loop automatically
	}
}

Conclusion

Atomic bit-packing is a powerful technique in systems programming. It breaks the "structs need locks" mindset by miniaturizing data structures and fitting them into the extreme limits of CPU register width, achieving ultimate concurrent performance.

Of course, this optimization comes at a cost. Code readability suffers, debugging becomes harder, and strict hardware limits apply. But for infrastructure like load balancers and high-frequency trading gateways where every nanosecond counts, this spirit of "squeezing every bit" is exactly what defines high-performance engineering.