Hash Trie Iterator and Serialization: Design Trade-offs in Industrial Prefix Trees
In traditional Trie implementations, each node typically needs to maintain a fixed-size array of child pointers. While this design offers fast lookups, it consumes enormous space—especially when dealing with large character sets. Industrial systems need a more space-efficient solution. Hash Trie emerges as a clever balance between space and lookup speed. Today we'll analyze a Hash Trie implementation from an industrial C++ library.
Space Problems of Traditional Tries
The dilemma of traditional tries:
- Fixed array overhead: Each node needs a pointer array for the entire character set
- Sparse storage waste: Most nodes have only a few children
- Memory fragmentation:大量指针导致内存管理复杂
The industrial solution is hash buckets + parent-child pointers:
- Use hash buckets instead of arrays to save space
- Use parent-child pointers instead of child pointers to support backtracking
- Support serialization for persistence
Core Design in Industrial Implementation
In an industrial C++ library, I found a concise and efficient Hash Trie implementation:
struct TSimpleHashTrie {
struct TLeaf {
int ParentId, Char;
int WordId;
TLeaf() : ParentId(-1), Char(0), WordId(-1) {}
};
TVector<TLeaf> Leafs;
TVector<int> LeafPtrs; // bucket boundaries
static ui32 HashAddChar(ui32 h, int Char) {
return 27214361 + h * 821345 + Char;
}
class TIterator {
const TSimpleHashTrie& Trie;
int LeafId;
ui32 Hash;
int WordId;
public:
TIterator(const TSimpleHashTrie& trie) : Trie(trie) { Reset(); }
bool NextChar(int c) {
ui32 hashVal = TSimpleHashTrie::HashAddChar(Hash, c);
int bucket = hashVal & (Trie.LeafPtrs.size() - 2);
// Linear search in bucket
for (int b = Trie.LeafPtrs[bucket], k = Trie.LeafPtrs[bucket + 1]; b < k; ++b) {
const TLeaf& curLeaf = Trie.Leafs[b];
if (curLeaf.ParentId == LeafId && curLeaf.Char == c) {
LeafId = b;
Hash = hashVal;
WordId = curLeaf.WordId;
return true;
}
}
return false;
}
int GetWordId() const { return WordId; }
};
};
Choice: Using hash buckets + parent-child pointer design
Trade-off considerations:
- Pros: Space-efficient, supports serialization
- Cons: Slightly slower lookup due to hash collisions
- Use cases: Large-scale string collection storage
Clean-room Reimplementation: Rust Implementation
To demonstrate the design thinking, I reimplemented the core logic in Rust:
/// Hash Trie node
struct Leaf {
parent_id: i32,
ch: i32,
word_id: i32,
}
/// Hash Trie structure
struct HashTrie {
leaves: Vec<Leaf>,
leaf_ptrs: Vec<usize>, // bucket boundaries
}
impl HashTrie {
/// Hash function: h = 27214361 + h * 821345 + c
fn hash_add_char(h: u32, ch: i32) -> u32 {
27214361 + h * 821345 + (ch as u32)
}
/// Build Trie
fn build(words: &[Vec<i32>]) -> HashTrie {
// ... implementation
}
/// Serialize
fn serialize(&self) -> Vec<u8> {
let mut bytes = Vec::new();
bytes.extend_from_slice(&(self.leaves.len() as u32).to_le_bytes());
for leaf in &self.leaves {
bytes.extend_from_slice(&leaf.parent_id.to_le_bytes());
bytes.extend_from_slice(&leaf.ch.to_le_bytes());
bytes.extend_from_slice(&leaf.word_id.to_le_bytes());
}
for ptr in &self.leaf_ptrs {
bytes.extend_from_slice(&(*ptr as u32).to_le_bytes());
}
bytes
}
/// Deserialize
fn deserialize(data: &[u8]) -> HashTrie {
// ... implementation
}
}
Trade-off Analysis
| Aspect | Traditional Array Trie | Hash Trie |
|---|---|---|
| Space | O(charset × nodes) | O(nodes) |
| Lookup | O(1) average | O(k/b) where k=bucket size |
| Serialization | Complex | Simple linear layout |
| Cache locality | Good | Depends on hash distribution |
When to Use Hash Trie
Suitable scenarios:
- Large character sets (e.g., Unicode)
- Sparse string collections
- Need for persistent storage
- Memory-constrained environments
Not suitable:
- Extremely latency-sensitive lookups
- Small character sets (e.g., 26 letters)
Summary
Industrial Hash Trie design is full of trade-offs:
- Hash buckets vs. arrays: Save space but increase collision probability
- Parent-child pointers vs. child pointers: Support backtracking but add indirection
- Iterator design: Supports forward traversal but no random access
This design represents industrial systems' extreme pursuit of space—minimizing memory overhead while ensuring functionality. Serialization support makes it an ideal string indexing structure in distributed systems.