← Back to Blog

Extreme Read Performance: The Append-Only SkipList Pattern

In concurrent data structure design, we often fall into a trap: trying to build an omnipotent container that supports full CRUD operations while being thread-safe. This versatility comes at a heavy cost—complex locking mechanisms, expensive atomic loops (CAS), and the hair-pulling nightmare of memory reclamation (think Hazard Pointers or Epoch-based Reclamation).

But in real-world industrial systems, the problem space is often more specific. If we are dealing with an append-only scenario—such as log storage, the MemTable of a time-series database, or an index for immutable data—can we shed those heavy burdens?

The answer is yes. By sacrificing the "delete" operation, we gain extreme read performance: wait-free reads with absolutely no lock contention.

Core Design Philosophy

The core philosophy of an Append-only SkipList is simple: Once a node is linked into the list, it exists forever, and its content is immutable.

This simple constraint brings massive benefits:

  1. No Memory Reclamation Required: Readers never have to worry about a node being freed while they are reading it.
  2. No Complex Locks: Since the position of a node never changes (no nodes are removed or moved), we only need to ensure that the "insertion of new nodes" is atomic for readers.

Structure Breakdown

In a standard SkipList, nodes contain multiple levels of index pointers. In the append-only design, we keep this structure but redefine pointer operations:

  • Node: Contains a Key, Value, and an array of Atomic Pointers. Key and Value are immutable after initialization.
  • Next Pointers: Stored using atomic variables. Written with Release semantics, read with Acquire semantics.

Concurrency Model: Single-Writer Multi-Reader

To simplify implementation and guarantee correctness, a Single-Writer Multi-Reader (SWMR) model is often adopted, or a mutex is used on the writer side. You might ask: "Isn't locking the writer too slow?"

In high-throughput scenarios, writes can often be batched or buffered via a Write Ahead Log (WAL). Reads, on the other hand, are often massive and random. As long as reads are fast enough, a little waiting on the writer side is usually acceptable.

Writer Workflow

The writer's workflow is as follows:

  1. Create New Node: Allocate memory, initialize Key and Value, compute random height.
  2. Find Insertion Point: Traverse from high to low levels, finding the predecessor at each level.
  3. Atomic Linking: Link the new node into the list starting from the lowest level (Level 0) upwards.

Key Point: Why bottom-up? Although order isn't strictly fatal for append-only (since there are no deletions), linking bottom-up is usually more intuitive: once Level 0 is linked, the node logically exists in the list (even if higher-level indexes aren't built yet, correctness is preserved, only search speed is affected). In Rust's memory model, the Store Release operation is paramount.

Reader Workflow

The reader's job is incredibly easy:

  1. No locks.
  2. No Epoch Guard protection.
  3. Just use Load Acquire to read pointers.

Since nodes never disappear, pointers held by readers always point to valid memory (until the entire structure is destroyed).

Deep Dive: Rust Implementation

To demonstrate this concept, we'll build a streamlined model in Rust. We'll use AtomicPtr to manage connections between nodes and leverage Rust's type system to ensure memory safety (mostly).

Note: To keep the code focused on the algorithm, the following implementation uses unsafe blocks for raw pointer manipulation, which is standard when building low-level concurrency primitives.

use std::sync::atomic::{AtomicPtr, Ordering};
use std::ptr;
use rand::Rng;

// Max height, typically set based on estimated data volume
const MAX_HEIGHT: usize = 12;

struct Node<K, V> {
    key: K,
    value: V,
    // Atomic pointer array for concurrent reads
    next: [AtomicPtr<Node<K, V>>; MAX_HEIGHT],
}

impl<K, V> Node<K, V> {
    fn new(key: K, value: V) -> *mut Self {
        let node = Box::new(Node {
            key,
            value,
            next: Default::default(), // Initialize as null pointers
        });
        Box::into_raw(node)
    }
}

pub struct AppendOnlySkipList<K, V> {
    head: *mut Node<K, V>,
}

impl<K, V> AppendOnlySkipList<K, V>
where
    K: Ord,
{
    pub fn new() -> Self {
        // Sentinel node. Simplified for demo purposes.
        unsafe {
            let head = Node::new(std::mem::zeroed(), std::mem::zeroed()); 
            AppendOnlySkipList { head }
        }
    }

    // Write operation: requires &mut self, meaning only one writer at a time
    // or protected by an external Mutex
    pub fn insert(&mut self, key: K, value: V) {
        let mut prevs = [self.head; MAX_HEIGHT];
        let mut curr = self.head;

        // 1. Locate insertion point
        for i in (0..MAX_HEIGHT).rev() {
            loop {
                // Reader uses Acquire, here Writer uses Relaxed load is fine,
                // because we have &mut self, so no concurrent write contention
                let next = unsafe { (*curr).next[i].load(Ordering::Relaxed) };
                
                // Find predecessor of first node >= key
                if next.is_null() || unsafe { &(*next).key >= &key } {
                    prevs[i] = curr;
                    break;
                }
                curr = next;
            }
        }

        // 2. Create new node
        let height = self.random_height();
        let new_node = Node::new(key, value);

        // 3. Link node
        for i in 0..height {
            let prev = prevs[i];
            let next = unsafe { (*prev).next[i].load(Ordering::Relaxed) };
            
            unsafe {
                // Set next pointer of new node
                (*new_node).next[i].store(next, Ordering::Relaxed);
                
                // Crucial step: Point prev's next to the new node
                // Use Release semantics to ensure that when reader sees the pointer,
                // new_node's content (key, value, next) is fully initialized
                (*prev).next[i].store(new_node, Ordering::Release);
            }
        }
    }

    // Read operation: &self, supports concurrent calls
    pub fn get(&self, key: &K) -> Option<&V> {
        let mut curr = self.head;
        for i in (0..MAX_HEIGHT).rev() {
            loop {
                // Crucial step: Use Acquire semantics
                // Ensures if we see a non-null next, the content pointed to 
                // is not only visible but also up-to-date
                let next = unsafe { (*curr).next[i].load(Ordering::Acquire) };
                
                if next.is_null() {
                    break;
                }
                
                let next_ref = unsafe { &*next };
                if &next_ref.key > key {
                    break;
                }
                if &next_ref.key == key {
                    return Some(&next_ref.value);
                }
                curr = next;
            }
        }
        None
    }

    fn random_height(&self) -> usize {
        let mut h = 1;
        let mut rng = rand::thread_rng();
        while h < MAX_HEIGHT && rng.gen_bool(0.25) {
            h += 1;
        }
        h
    }
}

The Magic of Memory Ordering

The most intriguing part of the code is the pairing of Ordering::Release and Ordering::Acquire:

  1. Writer (Release): When the writer executes (*prev).next[i].store(new_node, Release), it guarantees that all memory writes before this point (i.e., new_node's Key, Value, and its own next pointer initialization) are visible to any thread that observes this store.
  2. Reader (Acquire): When the reader executes load(Acquire) and reads the address of new_node, it guarantees visibility of all writes that happened before the Release.

This establishes a Happens-Before relationship. Without this synchronization, a reader might read a non-null pointer, follow it, but find garbage in the Key, or an uninitialized Value.

Limitations of this Design

Every technical choice is a trade-off. The cost of an Append-only SkipList is obvious:

  1. Memory Bloat: Since deletions are impossible, all historical data resides in memory. This limits its use cases—it's not suitable as a general-purpose cache, but fits perfectly as an LSM Tree MemTable (flushed to disk periodically) or a short-lived streaming index.
  2. Specific Write Patterns: It assumes append-only writes or allows duplicate Keys (if no extra checks are made). If Value updates are needed, they are usually implemented by inserting a newer version (MVCC) rather than in-place modification.

Conclusion

The Append-only SkipList is a prime example of "subtractive" design. By removing the "delete" feature, it eliminates the thorniest memory reclamation issues in concurrent programming, simplifying complex lock-free algorithms into a few atomic pointer operations. When building high-performance logging systems, message queues, or database engines, this specialized data structure often delivers lower latency and higher throughput than a generic ConcurrentHashMap.

Sometimes, to run faster, you just need to drop some unnecessary luggage.