The Art of Sync: Conflict Resolution in Industrial File Queues
In distributed file sync systems, how do we ensure files reliably sync from A to B? When sync fails, do we discard or retry? If retrying, how do we avoid duplicate processing? These questions form the core challenges of file sync queues. Today we'll analyze an industrial file sync queue implementation to see how it solves complex distributed problems with elegant design.
The Core Problem: Reliable Delivery and Conflict Handling
The fundamental challenges in file sync:
- Reliability: Ensure files eventually sync successfully
- Duplicate detection: Avoid side effects from duplicate processing
- Conflict resolution: How to decide when multiple nodes process simultaneously
The industrial solution is database-backed queue + attempt counter:
- Use database for persistent storage to prevent data loss
- Use attempt counter to track retry count
- Use "return" mechanism to handle failures instead of discarding
Core Design in Industrial Implementation
In an industrial cloud storage system, I found a file sync queue module refined over years. Its design choices are remarkably pragmatic:
Design One: Database-Backed Queue
class TDatasyncQueueClient: public TDBEntities<TDatasyncQueueEntry> {
TMap<TString, TDatasyncQueueEntry> Get(ui64 amount, ...) const;
bool Delete(const TString& assignmentId, ...) const;
bool Push(const TString& assignmentId, ...) const;
bool Return(TDatasyncQueueEntry assignment, ...) const;
};
Choice: Use database as the persistent layer for the queue
Trade-off considerations:
- Pros: No data loss, supports transactions, cross-process sharing
- Cons: Higher latency than in-memory queues, database bottleneck possible
- Good for: Production environments requiring high reliability
Design Two: Attempt Counter
bool TDatasyncQueueClient::Return(TDatasyncQueueEntry assignment, ...) {
assignment.SetProcessAt(TInstant::Now());
assignment.SetAttempt(assignment.GetAttempt() + 1);
return Upsert(assignment, session);
}
Choice: Increment Attempt counter and re-queue on failure
Trade-off considerations:
- Pros: Simple to implement, provides retry mechanism, can track failure count
- Cons: Only guarantees "at-least-once", not "exactly-once"
- Good for: Idempotent operations or repeatable tasks
Design Three: Session Mechanism
NDrive::TEntitySession TDatasyncQueueClient::BuildSession(bool readonly) const {
return NDrive::TEntitySession(Database->CreateTransaction(readonly));
}
Choice: Use session to encapsulate database transactions
Trade-off considerations:
- Pros: Unified transaction management, simplified error handling
- Cons: Increases call complexity, requires proper lifecycle management
Clean-room Reimplementation: Go Implementation
To demonstrate the design thinking, I reimplemented the core logic in Go:
package main
import (
"fmt"
"sync"
"time"
)
type QueueEntry struct {
AssignmentID string
CreatedAt time.Time
ProcessAt time.Time
Attempt int
}
type QueueClient struct {
mu sync.Mutex
data map[string]*QueueEntry
}
func NewQueueClient() *QueueClient {
return &QueueClient{data: make(map[string]*QueueEntry)}
}
func (c *QueueClient) Get(amount int) map[string]*QueueEntry {
c.mu.Lock()
defer c.mu.Unlock()
result := make(map[string]*QueueEntry)
now := time.Now()
for id, entry := range c.data {
if len(result) >= amount {
break
}
if !entry.ProcessAt.After(now) {
result[id] = entry
}
}
return result
}
func (c *QueueClient) Delete(assignmentID string) bool {
c.mu.Lock()
defer c.mu.Unlock()
if _, ok := c.data[assignmentID]; ok {
delete(c.data, assignmentID)
return true
}
return false
}
func (c *QueueClient) Push(assignmentID string) bool {
c.mu.Lock()
defer c.mu.Unlock()
now := time.Now()
c.data[assignmentID] = &QueueEntry{
AssignmentID: assignmentID,
CreatedAt: now,
ProcessAt: now,
Attempt: 1,
}
return true
}
// Return is the key to conflict resolution: re-queue on failure
func (c *QueueClient) Return(entry QueueEntry) bool {
c.mu.Lock()
defer c.mu.Unlock()
now := time.Now()
entry.ProcessAt = now
entry.Attempt++ // Key: increment attempt counter
c.data[entry.AssignmentID] = &entry
return true
}
func (c *QueueClient) Size() int {
c.mu.Lock()
defer c.mu.Unlock()
return len(c.data)
}
func main() {
client := NewQueueClient()
// Push entries
client.Push("file-001")
client.Push("file-002")
fmt.Printf("Queue size: %d\n", client.Size())
// Get and process
entries := client.Get(2)
for id, entry := range entries {
fmt.Printf("Processing: %s (attempt %d)\n", id, entry.Attempt)
client.Delete(id)
}
// Simulate failure - use Return
failedEntry := QueueEntry{AssignmentID: "file-002", Attempt: 1}
client.Return(failedEntry)
fmt.Printf("After Return, queue size: %d\n", client.Size())
// Verify attempt counter increased
entries = client.Get(10)
for id, entry := range entries {
fmt.Printf("Entry: %s, Attempt: %d\n", id, entry.Attempt)
}
}
Output:
Queue size: 2
Processing: file-001 (attempt 1)
Processing: file-002 (attempt 1)
After Return, queue size: 1
Entry: file-002, Attempt: 2
When to Use Database-Backed Queues
Good fit:
- High reliability requirements, cannot lose data
- Queue shared across multiple processes or services
- Need persistent queue state for recovery
Poor fit:
- Ultra-low latency requirements
- Small systems where in-memory queue is sufficient
- Already have message queue infrastructure
Summary
Design in industrial file sync queues is full of trade-offs:
- Database-backed vs. in-memory queue: reliability vs. performance
- At-least-once vs. exactly-once: simplicity vs. accuracy
- Attempt counter vs. deduplication table: lightweight vs. precise
In Go, we can implement similar designs more concisely, but the core trade-offs remain the same — there's no perfect solution, only choices that fit the scenario.