← Back to Blog

The Power of Strings: Design Trade-offs in Industrial Hash Functions

String hashing is one of the most fundamental operations in computer science. From compiler symbol tables to database indexes, from cache keys to spell checkers, almost every software system uses hash functions behind the scenes. However, choosing the right hash function is not a simple decision—it involves trade-offs between speed, distribution quality, collision probability, and security. Today we'll analyze a string hash implementation from an industrial C++ library.

The Core Problem: Challenges of String Hashing

The fundamental challenges in string hashing:

  1. Speed vs. Quality: Simple hashes are fast but have poor distribution, complex hashes have better quality but are slower
  2. Case Sensitivity: Some scenarios require case-insensitive hashing
  3. Security: Hash flooding attacks (HashDoS) are real threats

The industrial solution is configurable hash strategies:

  • Simple polynomial hash: For internal data structures
  • FNV-1a: For scenarios requiring better distribution
  • SipHash: For scenarios requiring security

Core Design in Industrial Implementation

In an industrial C++ library, I found a hash implementation for case-insensitive strings:

size_t TCiString::hashVal(const char* s, size_t len, const CodePage& cp) {
    size_t h = len;
    for (; /* (*s) && */ len--; ++s)
        h = 5 * h + cp.ToLower(*s);
    return h;
}

Choice: Using polynomial hash with base 5

Trade-off considerations:

  • Pros: Extremely simple implementation, CPU cache-friendly
  • Cons: Higher collision probability
  • Use cases: Internal data structures, high error tolerance

Clean-room Reimplementation: Rust Implementation

To demonstrate the design thinking, I reimplemented the core logic in Rust:

/// Simple polynomial hash (similar to industrial implementation)
/// h = len + 5 * h + c
fn polynomial_hash(s: &str) -> u64 {
    let mut h = s.len() as u64;
    for byte in s.bytes() {
        h = 5u64.wrapping_mul(h).wrapping_add(byte as u64);
    }
    h
}

/// Case-insensitive polynomial hash
fn ci_polynomial_hash(s: &str) -> u64 {
    let mut h = s.len() as u64;
    for byte in s.bytes() {
        let lower = if byte >= b'A' && byte <= b'Z' {
            byte + 32 // Convert to lowercase
        } else {
            byte
        };
        h = 5u64.wrapping_mul(h).wrapping_add(lower as u64);
    }
    h
}

/// FNV-1a hash
fn fnv1a_hash(s: &str) -> u64 {
    const FNV_OFFSET: u64 = 14695981039346656037;
    const FNV_PRIME: u64 = 1099511628211;
    
    let mut hash = FNV_OFFSET;
    for byte in s.bytes() {
        hash ^= byte as u64;
        hash = hash.wrapping_mul(FNV_PRIME);
    }
    hash
}

Output:

=== Hash Collision Analysis ===

--- Polynomial Hash ---
abc -> 3389 | aBc -> 3229
test -> 20216 | TEST -> 15224

--- Case-Insensitive Polynomial Hash ---
abc -> 3389 | aBc -> 3389 | COLLISION!
test -> 20216 | TEST -> 20216 | COLLISION!

--- FNV-1a Hash ---
abc -> 16654208175385433931 | aBc -> 16623808877894711403
test -> 18007334074686647077 | TEST -> 1069789605790756389

Comparison of Different Hash Functions

Feature Polynomial (base 5) FNV-1a SipHash
Speed Fastest Medium Slowest
Distribution Average Excellent Excellent
Collision Higher Low Low
Security None None Yes
Use Case Internal cache General External input

When to Use Different Hash Functions

Simple polynomial hash:

  • Internal data structures
  • High error tolerance scenarios
  • Scenarios requiring extreme performance

FNV-1a:

  • Scenarios requiring good distribution
  • General string keys
  • Scenarios not requiring cryptographic security

SipHash:

  • Processing external input
  • Preventing HashDoS attacks
  • Scenarios where hash tables may be controlled by attackers

Summary

Industrial string hash function design is full of trade-offs:

  • Simplicity vs. distribution quality: Polynomial hash is simple but has higher collision probability
  • Speed vs. security: SipHash is most secure but slowest
  • Case handling: Depends on business requirements

In Rust, we can implement these hash functions more safely (using wrapping_mul to prevent overflow), but the core trade-offs are the same — there's no perfect hash function, only choices that fit the scenario.

Understanding these trade-offs is crucial for building efficient and reliable software systems.