The Art of Locking Granularity: Segmented Sharding and Pre-Hashing in Concurrent Sets
In high-concurrency systems, the standard library's Mutex<HashMap<K, V>> is often a performance bottleneck. When dozens of threads contend for a single global lock, CPU time is wasted on context switching and lock waiting rather than actual business logic.
Today, we explore a pattern widely proven in the industry: the combination of Lock Sharding and Pre-hashing. We will use Rust to demonstrate how to build a ShardedHashSet that minimizes lock contention while maintaining thread safety.
Core Design: Divide and Conquer
The core idea of lock sharding is incredibly simple: if one big lock is too slow, break it into $N$ small locks.
We divide the entire hash space into $N$ Shards. Each shard is an independent RwLock<HashSet<T>>. When inserting or looking up an element:
- Calculate the hash of the element.
- Determine which shard it belongs to via
hash % shard_count. - Lock only that specific shard.
As long as the data distribution is uniform, concurrency can theoretically improve by a factor of $N$.
Pre-hashing: Compute Once, Use Everywhere
In a sharded design, hash calculation is critical. A common pitfall involves calculating the hash once to locate the shard, and then calculating it again inside the shard's internal HashSet insertion logic. If the Key is a long string or complex object, this duplicate computation is expensive.
For extreme performance, we introduce a PreHashed structure to cache the hash value and pass it along with the Key.
Rust Implementation
We will use the parking_lot crate because its RwLock is lighter and supports better parking strategies than the standard library.
use std::collections::hash_map::DefaultHasher;
use std::collections::HashSet;
use std::hash::{Hash, Hasher};
use std::sync::Arc;
use parking_lot::RwLock;
// Pre-hashed wrapper: carries the pre-computed hash value
#[derive(Debug, Clone)]
struct PreHashed<T> {
value: T,
hash: u64,
}
impl<T: Hash> PreHashed<T> {
fn new(value: T) -> Self {
let mut hasher = DefaultHasher::new();
value.hash(&mut hasher);
let hash = hasher.finish();
Self { value, hash }
}
}
// To let HashSet use our pre-computed hash, we would ideally need a custom Hasher.
// Note: Here we assume the internal HashSet might re-hash or we use this hash
// primarily for shard routing. In a strict production environment, you might
// use a "PassThroughHasher" inside the shards to avoid re-hashing entirely.
//
// The code below focuses on the sharding logic.
struct ShardedHashSet<T> {
shards: Vec<RwLock<HashSet<T>>>,
shard_mask: usize,
}
impl<T: Hash + Eq + Clone> ShardedHashSet<T> {
pub fn new(concurrency: usize) -> Self {
// Ensure shard count is a power of two for efficient bitwise masking
let shard_count = concurrency.next_power_of_two();
let mut shards = Vec::with_capacity(shard_count);
for _ in 0..shard_count {
shards.push(RwLock::new(HashSet::new()));
}
Self {
shards,
shard_mask: shard_count - 1,
}
}
fn get_shard_index(&self, hash: u64) -> usize {
(hash as usize) & self.shard_mask
}
pub fn insert(&self, value: T) -> bool {
// 1. Pre-compute hash (done only once)
let pre_hashed = PreHashed::new(value);
// 2. Locate shard
let idx = self.get_shard_index(pre_hashed.hash);
let shard = &self.shards[idx];
// 3. Lock specific shard and write
// Note: Std HashSet will re-hash, but lock granularity is already fine-grained.
let mut guard = shard.write();
guard.insert(pre_hashed.value)
}
pub fn contains(&self, value: &T) -> bool {
let mut hasher = DefaultHasher::new();
value.hash(&mut hasher);
let hash = hasher.finish();
let idx = self.get_shard_index(hash);
let shard = &self.shards[idx];
let guard = shard.read();
guard.contains(value)
}
}
// Simulating high concurrency scenario
fn main() {
let set = Arc::new(ShardedHashSet::new(64));
let mut handles = vec![];
for i in 0..10 {
let set_ref = set.clone();
handles.push(std::thread::spawn(move || {
for j in 0..1000 {
set_ref.insert(format!("key-{}-{}", i, j));
}
}));
}
for h in handles {
h.join().unwrap();
}
println!("Done. Check logical size requires iterating all shards.");
}
Deep Dive
1. Granularity and Overhead
In the code above, setting concurrency to 64 theoretically reduces lock conflict probability by 64 times. This is extremely beneficial for Read-Heavy or mixed workloads. Compared to complex Lock-Free CAS operations (like in Java's ConcurrentHashMap), lock sharding strikes an excellent balance between implementation complexity and performance. It avoids complex memory ordering issues while offering scalability far superior to a global lock.
2. The Resizing Challenge
The biggest pain point of sharded design is global resizing. When a HashSet inside a single shard needs to grow, it only locks itself, which is fine. However, if the ShardedHashSet itself needs to increase the number of shards (e.g., from 64 to 128), it typically requires a "Stop-the-World" event, locking all shards to re-distribute elements.
Therefore, in industrial practice (like certain C++ TConcurrentHashSet implementations), it is common to pre-allocate a sufficient number of buckets (BucketCount) or accept that individual internal hash tables will grow indefinitely, avoiding the expensive cost of global re-hashing.
3. Use Cases
- Caching Systems: In-memory object caches with high read/write ratios and random key distribution.
- Deduplication Services: High-concurrency ingestion of data streams for real-time deduplication.
By employing this strategy of "trading space for time" and "divide and conquer," we can build high-performance concurrent data structures using the simplest synchronization primitives. This is the essence of "Simple is Beautiful" in systems programming.