Memory Indexing and Atomic Banning Strategy in Real-Time Search

When building high-throughput real-time search systems, the Memory Indexer serves as the critical bridge between the ingestion stream and persistent storage. Its core challenge lies in maintaining high-concurrency writes while efficiently handling immediate document invalidation (banning) and status queries.

This article explores a document management strategy based on atomic operations, demonstrating its core logic using Rust.

Context: Read-Write Conflicts and State Management

In typical search architectures, documents are not immediately flushed to disk upon arrival. Instead, they reside in memory structures to accumulate. During this phase, the system must maintain a DocInfo list to track the status of each document (e.g., whether it is alive or deleted).

If we simply use a global lock to protect the entire list, read/write performance degrades sharply as concurrency increases. A superior strategy involves using Atomic Operations to manage the state of individual documents, thereby minimizing lock granularity and enabling lock-free paths for certain read operations.

Implementation: Atomic State Management

We will implement a simplified memory document manager. Each document contains an atomic counter (for reference counting or versioning) and a status flag.

Data Structure Definition

Leveraging Rust's std::sync::atomic, we can build a thread-safe document status structure:

use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
use std::sync::{Arc, RwLock};

// Simulating document metadata
struct DocInfo {
    // Global unique ID for the document
    doc_id: u64,
    // Atomic boolean to mark if the document is valid (not banned)
    is_valid: AtomicBool,
    // Atomic counter, useful for versioning or reference counting
    ref_count: AtomicU64,
}

impl DocInfo {
    fn new(doc_id: u64) -> Self {
        Self {
            doc_id,
            is_valid: AtomicBool::new(true),
            ref_count: AtomicU64::new(0),
        }
    }

    // Fast status check, no mutex required
    fn is_active(&self) -> bool {
        self.is_valid.load(Ordering::Acquire)
    }

    // Atomic operation: mark document as invalid (Ban)
    fn ban(&self) {
        self.is_valid.store(false, Ordering::Release);
    }
}

Container and Concurrency Control

To manage these documents, we need a container. While the state of a single document is atomic, the expansion of the container (push) still requires protection. We use RwLock, but the key insight is: We do not need a write lock to update a document's status, nor do we need a read lock if we already hold the Arc pointer.

struct MemoryIndexer {
    // Use RwLock to protect structural changes to the Vec (like pushing)
    // Storing Arc<DocInfo> ensures stable addresses and independent sharing
    docs: RwLock<Vec<Arc<DocInfo>>>,
}

impl MemoryIndexer {
    fn new() -> Self {
        Self {
            docs: RwLock::new(Vec::new()),
        }
    }

    // Write new document: requires write lock
    fn add_document(&self, doc_id: u64) {
        let doc = Arc::new(DocInfo::new(doc_id));
        let mut w_guard = self.docs.write().unwrap();
        w_guard.push(doc);
    }

    // Mark as deleted: lookup via index only requires a read lock
    // The actual modification is atomic
    fn ban_document(&self, index: usize) -> bool {
        let r_guard = self.docs.read().unwrap();
        if let Some(doc) = r_guard.get(index) {
            // This is the essence of the design:
            // We hold a read lock on the container but can modify the document's state
            // because the state possesses Interior Mutability.
            doc.ban();
            return true;
        }
        false
    }

    // Status snapshot: high frequency, low lock contention
    fn get_active_count(&self) -> usize {
        let r_guard = self.docs.read().unwrap();
        r_guard.iter()
            .filter(|doc| doc.is_active())
            .count()
    }
}

Advantages of This Pattern

Extreme Read-Write Separation: Since AtomicBool provides interior mutability, we only need the container's "read lock" when modifying a document's logical state (banning). This means multiple threads can ban different documents simultaneously without blocking each other. A write lock is only necessary when adding new documents (changing container size).
CPU Cache Friendliness: Compared to locking a massive structure, atomic operations only invalidate the relevant cache lines, reducing bus traffic between cores.
Avoiding Shift Overhead: In memory indexing, "deletion" is typically just a marking operation. Physical deletion is deferred until the flush or merge phase. This "soft delete" strategy avoids the memory move overhead of Vec::remove and guarantees index stability.

Summary

In Rust, combining RwLock for structural protection with Atomic for state protection allows for building extremely high-performance memory indexing components. This design pattern is not only applicable to search engines but is also widely useful for any system requiring frequent state updates and concurrent queries. It demonstrates how fine-grained concurrency control can approach the performance limits of manual C++ memory management while maintaining Rust's memory safety guarantees.