Lazy Deletion and Space Compaction in Search Systems

When building high-performance search systems, we often face a dilemma: how to handle deletion requests quickly while maintaining index compactness and query efficiency?

Deleting documents directly from an inverted index on disk is prohibitively expensive. Because inverted indexes are typically highly compressed and stored contiguously, deleting a single document ID might require rewriting an entire index segment.

Therefore, most modern search systems (like Lucene or the architecture discussed today) adopt a Lazy Deletion strategy. A deletion operation merely marks the document as "deleted" in memory or in a specific "tombstone" file. The actual physical deletion and space reclamation occur during a background process known as Segment Merge.

Today, we will demonstrate the core logic of this process using the Zig programming language: specifically, how to filter out deleted documents using a bitmap during the merger of multiple index segments to achieve space compaction.

Core Design: Bitmap Filtering and Streaming Merge

The merge process is essentially a multi-way merge sort, but with the introduction of a "filter" during the merging phase.

Input: Multiple old index segments, each containing a sequence of sorted document IDs.
Deletion Set: A global or per-segment bitmap indicating which IDs are "dead".
Output: A new, more compact index segment containing only "live" documents.

This design defers expensive IO operations to the merge phase, which is typically sequential read/write and thus very disk-friendly.

Zig Implementation: Clean Room Reconstruction

We will simulate a simplified scenario: merging two lists of integers (representing document IDs) and filtering out IDs present in a blocklist. To demonstrate Zig's memory safety features, we will manage memory allocators explicitly.

const std = @import("std");
const Allocator = std.mem.Allocator;

/// Simulating a simple index segment iterator
const SegmentIterator = struct {
    docs: []const u32,
    current: usize = 0,

    pub fn next(self: *SegmentIterator) ?u32 {
        if (self.current >= self.docs.len) return null;
        const doc = self.docs[self.current];
        self.current += 1;
        return doc;
    }

    pub fn peek(self: *SegmentIterator) ?u32 {
        if (self.current >= self.docs.len) return null;
        return self.docs[self.current];
    }
};

/// Simulating a deletion set (using a HashMap instead of a Bitmap for simplicity)
const DeletionSet = std.AutoHashMap(u32, void);

/// Merger: Performs multi-way merge and filtering
const IndexMerger = struct {
    allocator: Allocator,
    deletion_set: DeletionSet,

    pub fn init(allocator: Allocator, deletions: DeletionSet) IndexMerger {
        return IndexMerger{
            .allocator = allocator,
            .deletion_set = deletions,
        };
    }

    /// Executes the merge, returning a new compact list of documents
    pub fn merge(self: *IndexMerger, segment_a: []const u32, segment_b: []const u32) ![]u32 {
        var list = std.ArrayList(u32).init(self.allocator);
        errdefer list.deinit();

        var iter_a = SegmentIterator{ .docs = segment_a };
        var iter_b = SegmentIterator{ .docs = segment_b };

        // Classic two-pointer merge algorithm, with an added "is_alive" check
        while (true) {
            const val_a = iter_a.peek();
            const val_b = iter_b.peek();

            var candidate: u32 = 0;

            if (val_a != null and val_b != null) {
                if (val_a.? < val_b.?) {
                    candidate = iter_a.next().?;
                } else if (val_a.? > val_b.?) {
                    candidate = iter_b.next().?;
                } else {
                    // Same ID (shouldn't happen in sharded indexes, but handled for robustness)
                    candidate = iter_a.next().?;
                    _ = iter_b.next();
                }
            } else if (val_a != null) {
                candidate = iter_a.next().?;
            } else if (val_b != null) {
                candidate = iter_b.next().?;
            } else {
                break; // Both empty, merge finished
            }

            // Key Step: Filter out deleted documents
            if (!self.deletion_set.contains(candidate)) {
                try list.append(candidate);
            }
        }

        return list.toOwnedSlice();
    }
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // 1. Prepare simulated data
    const seg1 = [_]u32{ 1, 3, 5, 7, 9 };
    const seg2 = [_]u32{ 2, 4, 6, 8, 10 };

    // 2. Mark IDs 3, 4, 9 as deleted
    var deletions = DeletionSet.init(allocator);
    defer deletions.deinit();
    try deletions.put(3, {});
    try deletions.put(4, {});
    try deletions.put(9, {});

    // 3. Execute merge
    var merger = IndexMerger.init(allocator, deletions);
    const result = try merger.merge(&seg1, &seg2);
    defer allocator.free(result);

    // 4. Verify output: 3, 4, 9 should be gone
    std.debug.print("Merged Docs: {any}\n", .{result});
    // Expected output: { 1, 2, 5, 6, 7, 8, 10 }
}

Trade-off Analysis

In the code above, we see the core logic of "Merge-Time Deletion".

Advantages:

High Write Throughput: Deletions become simple append-only writes (to a deletion log), requiring no random IO.
Complete Space Reclamation: The new merged segment not only removes dead data but is typically re-compressed using Delta encoding, further saving space.

Disadvantages:

Read Amplification: Before a merge occurs, queries must read both the index and the deletion bitmap to filter results, slightly increasing computational load during query time.
Temporary Space Waste: Deleted data continues to occupy disk space until the merge runs.

Conclusion

Lazy deletion is a classic application of the "space-time tradeoff" in storage system design. By deferring physical deletion to the merge phase, we ensure high concurrent write performance while ultimately achieving data consistency and storage compactness. Zig's explicit memory control and error handling mechanisms are well-suited for writing such resource-sensitive low-level system logic.