Cache Line Padding and Type Dispatch in Chunked Queues

In high-concurrency systems, queues are the backbone of producer-consumer patterns. But behind the seemingly simple implementation lie sophisticated design decisions. Today we dive into an industrial-grade chunked queue implementation, exploring its cache line padding and type dispatch mechanisms.

The Root Problem: False Sharing

On multi-core processors, each CPU core has its own L1 cache. When two threads operate on different data within the same cache line, hardware forces cache synchronization even without actual data contention—this is False Sharing.

For queues, producers writing and consumers reading are high-frequency operations. If writer and reader state happen to land on the same cache line, severe performance degradation occurs.

Solution: Cache Line Padding

Industrial code uses explicit padding to ensure each state field occupies an independent cache line:

// Cache line padding in industrial code
template <typename T, size_t PadSize = PLATFORM_CACHE_LINE>
struct TPadded: public T {
    char Pad[PadSize - sizeof(T) % PadSize];
    
    static_assert(sizeof(*this) % PadSize == 0, "padding does not work");
};

Then wrap queue read/write states:

TPadded<TWriterState> Writer;
TPadded<TReaderState> Reader;

This ensures that even with frequent updates from producers and consumers, cache line synchronization is avoided since they operate on different cache lines.

Trade-off Analysis

Space Cost: Each state field requires PadSize - sizeof(T) bytes of padding. For a typical 64-byte cache line with sizeof(T) = 8 (a pointer), padding overhead is 56 bytes—an 87.5% overhead.

Time Benefit: In high-frequency update scenarios, avoiding false sharing provides performance gains far exceeding the space cost. Industrial code typically chooses space over time.

Type Dispatch: POD vs Non-POD

Queues need to store arbitrary types. For POD (Plain Old Data) types (int, float, pointers), direct memcpy works. For non-POD types (classes with destructors), placement new construction and explicit destruction are required.

Industrial code uses Type Traits for automatic dispatch:

// POD type handler
template <typename T>
struct TPodTypeHelper {
    static void Write(T* ptr, TT&& value) { *ptr = value; }
    static T Read(T* ptr) { return *ptr; }
    static void Destroy(T* ptr) { Y_UNUSED(ptr); } // No-op
};

// Non-POD type handler
template <typename T>
struct TNonPodTypeHelper {
    static void Write(T* ptr, TT&& value) { 
        new (ptr) T(std::forward<TT>(value)); // placement new
    }
    static T Read(T* ptr) { return std::move(*ptr); }
    static void Destroy(T* ptr) { ptr->~T(); } // explicit destruction
};

// Choose based on type traits
template <typename T>
using TTypeHelper = std::conditional_t<
    TTypeTraits<T>::IsPod,
    TPodTypeHelper<T>,
    TNonPodTypeHelper<T>>;

Trade-off Analysis

POD Path: The compiler may inline read/write operations to single instructions, maximizing performance. But custom constructors/destructors cannot be called.

Non-POD Path: Requires explicit construction and destruction calls with some overhead. But correctly handles resource lifecycle.

Selection Basis: Types with resource management needs (memory, file handles) must use the non-POD path. Industrial code typically lets the compiler choose automatically based on type traits.

Chunking: Reducing Syscalls

Beyond cache line padding, chunked queues use chunking strategy: instead of individual allocation per enqueue, pre-allocate a 4KB (page size) chunk:

struct TChunk {
    static constexpr size_t MaxCount = (ChunkSize - sizeof(TChunkHeader)) / sizeof(T);
    char Entries[MaxCount * sizeof(T)];
};

When a chunk fills up, allocate a new chunk and link via pointers, forming a linked list.

Trade-off Analysis

Batch Allocation: Reduces malloc/new calls, lowering syscall overhead.

Cache Locality: Elements within the same chunk are contiguous in memory, improving cache hit rate.

Memory Fragmentation: If chunk size doesn't match page size, memory fragmentation may occur.

Clean Room Reimplementation: Zig Implementation

Here's the same design philosophy demonstrated in Zig (compilable code):

// Chunk queue clean room demo in Zig
// Demonstrates: cache line padding + chunk management

const std = @import("std");

pub const CacheLine = 64;

/// Cache line padding structure
fn Padded(comptime T: type) type {
    return struct {
        value: T,
        padding: [CacheLine - @sizeOf(T)]u8,

        pub fn init(initial: T) @This() {
            return .{
                .value = initial,
                .padding = [_]u8{0} ** (CacheLine - @sizeOf(T)),
            };
        }
    };
}

/// Calculate max elements per chunk
fn calcMaxCount(comptime T: type, comptime chunk_size: usize) usize {
    const header_size = @sizeOf(usize) * 2; // count + next pointer
    const element_size = @sizeOf(T);
    return (chunk_size - header_size) / element_size;
}

/// Single producer single consumer chunked queue
pub fn OneOneQueue(comptime T: type, comptime chunk_size: usize) type {
    const max_count = calcMaxCount(T, chunk_size);

    return struct {
        const Self = @This();
        const MaxElements = max_count;

        // Queue state with cache line padding
        // Key design: writer_state and reader_state each occupy one cache line
        // This prevents false sharing between producer and consumer threads
        writer_state: Padded(*anyopaque),
        reader_state: Padded(*anyopaque),
        reader_index: usize,

        pub fn init() Self {
            return .{
                .writer_state = Padded(*anyopaque).init(undefined),
                .reader_state = Padded(*anyopaque).init(undefined),
                .reader_index = 0,
            };
        }
    };
}

pub fn main() void {
    // Test cache line padding
    const PaddedUsize = Padded(usize);
    std.debug.print("Padded<usize> size: {} bytes\n", .{
        @sizeOf(PaddedUsize),
    });
    std.debug.print("CacheLine constant: {} bytes\n\n", .{CacheLine});

    // Test chunk size calculation
    std.debug.print("For i32 (4 bytes), chunk_size=4096:\n", .{});
    std.debug.print("  Max elements per chunk: {}\n", .{calcMaxCount(i32, 4096)});
}

The output validates the design:

Padded<usize> size: 64 bytes
CacheLine constant: 64 bytes

For i32 (4 bytes), chunk_size=4096:
  Max elements per chunk: 1020

Summary

Industrial-grade chunked queue design embodies rich engineering trade-offs:

Cache Line Padding: Trades space for time, eliminating false sharing. In high-concurrency scenarios, this is essential.
Type Dispatch: Automatically chooses optimal paths based on type traits. POD types use the fast path; non-POD types use the safe path.
Chunking: Batch allocation reduces syscalls and improves cache locality.

These designs aren't silver bullets—they're contextual trade-offs. Understanding these trade-offs is what enables truly efficient systems-level code.