The Art of Sync: Conflict Resolution in Industrial File Queues

In distributed file sync systems, how do we ensure files reliably sync from A to B? When sync fails, do we discard or retry? If retrying, how do we avoid duplicate processing? These questions form the core challenges of file sync queues. Today we'll analyze an industrial file sync queue implementation to see how it solves complex distributed problems with elegant design.

The Core Problem: Reliable Delivery and Conflict Handling

The fundamental challenges in file sync:

Reliability: Ensure files eventually sync successfully
Duplicate detection: Avoid side effects from duplicate processing
Conflict resolution: How to decide when multiple nodes process simultaneously

The industrial solution is database-backed queue + attempt counter:

Use database for persistent storage to prevent data loss
Use attempt counter to track retry count
Use "return" mechanism to handle failures instead of discarding

Core Design in Industrial Implementation

In an industrial cloud storage system, I found a file sync queue module refined over years. Its design choices are remarkably pragmatic:

Design One: Database-Backed Queue

class TDatasyncQueueClient: public TDBEntities<TDatasyncQueueEntry> {
    TMap<TString, TDatasyncQueueEntry> Get(ui64 amount, ...) const;
    bool Delete(const TString& assignmentId, ...) const;
    bool Push(const TString& assignmentId, ...) const;
    bool Return(TDatasyncQueueEntry assignment, ...) const;
};

Choice: Use database as the persistent layer for the queue

Trade-off considerations:

Pros: No data loss, supports transactions, cross-process sharing
Cons: Higher latency than in-memory queues, database bottleneck possible
Good for: Production environments requiring high reliability

Design Two: Attempt Counter

bool TDatasyncQueueClient::Return(TDatasyncQueueEntry assignment, ...) {
    assignment.SetProcessAt(TInstant::Now());
    assignment.SetAttempt(assignment.GetAttempt() + 1);
    return Upsert(assignment, session);
}

Choice: Increment Attempt counter and re-queue on failure

Trade-off considerations:

Pros: Simple to implement, provides retry mechanism, can track failure count
Cons: Only guarantees "at-least-once", not "exactly-once"
Good for: Idempotent operations or repeatable tasks

Design Three: Session Mechanism

NDrive::TEntitySession TDatasyncQueueClient::BuildSession(bool readonly) const {
    return NDrive::TEntitySession(Database->CreateTransaction(readonly));
}

Choice: Use session to encapsulate database transactions

Trade-off considerations:

Pros: Unified transaction management, simplified error handling
Cons: Increases call complexity, requires proper lifecycle management

Clean-room Reimplementation: Go Implementation

To demonstrate the design thinking, I reimplemented the core logic in Go:

package main

import (
	"fmt"
	"sync"
	"time"
)

type QueueEntry struct {
	AssignmentID string
	CreatedAt   time.Time
	ProcessAt   time.Time
	Attempt     int
}

type QueueClient struct {
	mu   sync.Mutex
	data map[string]*QueueEntry
}

func NewQueueClient() *QueueClient {
	return &QueueClient{data: make(map[string]*QueueEntry)}
}

func (c *QueueClient) Get(amount int) map[string]*QueueEntry {
	c.mu.Lock()
	defer c.mu.Unlock()

	result := make(map[string]*QueueEntry)
	now := time.Now()

	for id, entry := range c.data {
		if len(result) >= amount {
			break
		}
		if !entry.ProcessAt.After(now) {
			result[id] = entry
		}
	}
	return result
}

func (c *QueueClient) Delete(assignmentID string) bool {
	c.mu.Lock()
	defer c.mu.Unlock()

	if _, ok := c.data[assignmentID]; ok {
		delete(c.data, assignmentID)
		return true
	}
	return false
}

func (c *QueueClient) Push(assignmentID string) bool {
	c.mu.Lock()
	defer c.mu.Unlock()

	now := time.Now()
	c.data[assignmentID] = &QueueEntry{
		AssignmentID: assignmentID,
		CreatedAt:   now,
		ProcessAt:   now,
		Attempt:     1,
	}
	return true
}

// Return is the key to conflict resolution: re-queue on failure
func (c *QueueClient) Return(entry QueueEntry) bool {
	c.mu.Lock()
	defer c.mu.Unlock()

	now := time.Now()
	entry.ProcessAt = now
	entry.Attempt++  // Key: increment attempt counter

	c.data[entry.AssignmentID] = &entry
	return true
}

func (c *QueueClient) Size() int {
	c.mu.Lock()
	defer c.mu.Unlock()
	return len(c.data)
}

func main() {
	client := NewQueueClient()

	// Push entries
	client.Push("file-001")
	client.Push("file-002")
	fmt.Printf("Queue size: %d\n", client.Size())

	// Get and process
	entries := client.Get(2)
	for id, entry := range entries {
		fmt.Printf("Processing: %s (attempt %d)\n", id, entry.Attempt)
		client.Delete(id)
	}

	// Simulate failure - use Return
	failedEntry := QueueEntry{AssignmentID: "file-002", Attempt: 1}
	client.Return(failedEntry)
	fmt.Printf("After Return, queue size: %d\n", client.Size())

	// Verify attempt counter increased
	entries = client.Get(10)
	for id, entry := range entries {
		fmt.Printf("Entry: %s, Attempt: %d\n", id, entry.Attempt)
	}
}

Output:

Queue size: 2
Processing: file-001 (attempt 1)
Processing: file-002 (attempt 1)
After Return, queue size: 1
Entry: file-002, Attempt: 2

When to Use Database-Backed Queues

Good fit:

High reliability requirements, cannot lose data
Queue shared across multiple processes or services
Need persistent queue state for recovery

Poor fit:

Ultra-low latency requirements
Small systems where in-memory queue is sufficient
Already have message queue infrastructure

Summary

Design in industrial file sync queues is full of trade-offs:

Database-backed vs. in-memory queue: reliability vs. performance
At-least-once vs. exactly-once: simplicity vs. accuracy
Attempt counter vs. deduplication table: lightweight vs. precise

In Go, we can implement similar designs more concisely, but the core trade-offs remain the same — there's no perfect solution, only choices that fit the scenario.