The Game of Distributed ID Generators: Entropy, Collisions, and the 64-bit Truncation Trade-off
When building distributed systems (like crawlers, message queues, or database sharding), generating globally unique IDs (GUIDs) is a fundamental requirement. The most common solution is UUID (128-bit), which boasts a collision probability so low it is practically considered absolutely unique. However, in scenarios sensitive to high performance or storage, a 128-bit ID can be excessively bulky.
This leads to a classic trade-off: Can we truncate IDs to 64 bits (e.g., u64) while maintaining sufficient uniqueness?
This article explores the mathematical principles behind this design decision and implements an ID generator in Rust that balances entropy and space.
Why 64 Bits?
- Storage Efficiency: A 64-bit integer (8 bytes) is exactly half the size of a 128-bit UUID (16 bytes). When indexing hundreds of millions of rows, this saves significant memory and disk space.
- CPU Friendliness: Modern CPU registers are 64-bit. Handling comparisons, sorting, and hashing for
u64is significantly faster than for 128-bit integers or strings. - Database Compatibility: Most databases have excellent support for 64-bit integer primary keys (e.g., MySQL's
BIGINT).
The Risk of Truncation: The Birthday Paradox
Simply truncating a UUID to 64 bits is dangerous. According to the Birthday Paradox, if you generate 64-bit IDs purely randomly, the probability of a collision rises to 50% after generating just $2^{32}$ (about 4.3 billion) IDs. For large-scale distributed systems, this number is not out of reach.
Therefore, we need to introduce structured design rather than pure randomness.
Rust Implementation: Structured 64-bit ID Generator
We can borrow ideas from Twitter Snowflake but simplify them for different contexts. A typical 64-bit ID structure might contain:
- Timestamp: Occupies the high bits, ensuring roughly ordered IDs.
- Worker ID: Distinguishes between different nodes.
- Sequence: Prevents concurrency conflicts within the same millisecond.
- Entropy: Adds randomness.
Below is a Rust implementation example that attempts to balance these factors:
use std::sync::atomic::{AtomicU16, Ordering};
use std::time::{SystemTime, UNIX_EPOCH};
use rand::Rng;
/// Distributed ID Generator
/// Structure: [41 bits Timestamp] [10 bits Worker ID] [13 bits Sequence/Entropy]
pub struct IdGenerator {
worker_id: u64,
sequence: AtomicU16,
}
impl IdGenerator {
/// Create a new generator, requires a unique worker_id (0-1023)
pub fn new(worker_id: u64) -> Self {
if worker_id > 1023 {
panic!("Worker ID must be between 0 and 1023");
}
IdGenerator {
worker_id,
sequence: AtomicU16::new(0),
}
}
/// Generate the next unique 64-bit ID
pub fn next_id(&self) -> u64 {
// 1. Get millisecond timestamp (41 bits)
// 2026-01-01 00:00:00 UTC = 1767225600000
const EPOCH: u64 = 1767225600000;
let start = SystemTime::now();
let since_the_epoch = start
.duration_since(UNIX_EPOCH)
.expect("Time went backwards");
let timestamp = since_the_epoch.as_millis() as u64;
// Simple defense: if timestamp < EPOCH, config is wrong or clock skew is severe
let time_part = if timestamp > EPOCH {
timestamp - EPOCH
} else {
0
};
// 2. Sequence & Entropy (13 bits)
// Here we use an incrementing sequence.
// Note: In high concurrency, strict incrementing is needed to avoid conflicts.
let seq = self.sequence.fetch_add(1, Ordering::SeqCst) & 0x1FFF;
// 3. Combine ID
// Format: [Timestamp 41 bit] [Worker ID 10 bit] [Seq 13 bit]
(time_part << 23) | (self.worker_id << 13) | (seq as u64)
}
/// High-entropy version for testing/low-frequency scenarios only
/// Fills low bits with randomness, sacrificing strict ordering for statelessness
pub fn next_id_randomized(&self) -> u64 {
let mut rng = rand::thread_rng();
let random_part: u64 = rng.gen::<u64>() & 0x7FFFFF; // Low 23 bits random
let start = SystemTime::now();
let timestamp = start.duration_since(UNIX_EPOCH).unwrap().as_millis() as u64;
let time_part = timestamp & 0x1FFFFFFFFFF; // Take 41 bits
(time_part << 23) | random_part
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::collections::HashSet;
#[test]
fn test_uniqueness() {
let generator = IdGenerator::new(1);
let mut set = HashSet::new();
// Generate 100,000 IDs to check for collisions
for _ in 0..100000 {
let id = generator.next_id();
assert!(set.insert(id), "Collision detected: {}", id);
}
}
}
Design Trade-off Analysis
1. Entropy vs. Order
In the next_id implementation, we adopted the classic Snowflake structure. Its advantage is excellent ordering and high database insertion performance (especially B+ tree indexing). Its disadvantage is the reliance on state (sequence) and a unique worker_id. If worker_id configuration conflicts or clock rollback occurs, collisions are inevitable.
In next_id_randomized, we abandon worker_id and sequence for 23 bits of randomness. This means we don't need to maintain any state and can run on any machine at any time. However, the 23-bit random space (8.3 million) is very small, suitable only for very low-frequency generation scenarios (a few per millisecond). In high-frequency scenarios, the collision probability is extremely high.
2. The Cost of 64-bit Truncation
For many applications, 64 bits is the perfect sweet spot. But the cost must be clear: You lose the mathematical guarantee of global uniqueness. UUIDs (128-bit) can be generated in completely isolated systems without conflict, whereas 64-bit IDs must rely on coordination mechanisms (like assigning Worker IDs or a centralized issuer) to ensure uniqueness.
Conclusion
In the game of distributed IDs, there is no free lunch.
- If you need absolute decentralization and zero configuration, stick with 128-bit UUIDs.
- If you pursue extreme performance and storage efficiency, and can manage Worker ID assignment, then a structured 64-bit ID generator based on Rust is a powerful tool.
By precisely controlling the bit-width of timestamps, machine identifiers, and sequence numbers, we can build a robust skeleton for our systems within the narrow space of 64 bits.