The Guardian of Recommendation Systems: Industrial Quality Scoring Architecture

In recommendation systems, the quality of the recommendation pool directly determines user experience. When recommendation algorithms are updated, how do we ensure the new recommendations aren't worse than the old ones? This is a problem every industrial system must face. Today we'll analyze a real video recommendation quality checking module to see how it uses multi-dimensional scoring to safeguard recommendation quality.

The Core Problem: Guarding Recommendation Pool Quality

The fundamental challenges in recommendation systems:

Algorithm iteration: Each algorithm update can affect recommendation quality
Silent degradation: Some metrics may decline unnoticed
Multi-dimensional evaluation: Quality can't be judged by a single metric alone

The solution is multi-dimensional quality checkers:

Loss detection: Retention rate of critical query-rel-url combinations
Distribution detection: Whether quality factor distributions are reasonable
Percentage detection: Proportion of high-relevance content

Core Design in Industrial Implementation

In an industrial-grade video search system, I found a quality checking module that has been refined over years. Its design choices are remarkably pragmatic:

Design One: Interface Abstraction

typedef IDataChangeChecker<TPoolRecords> IPoolChangeChecker;

Choice: Use interface abstraction to define checkers

Trade-off considerations:

Standardized interface makes it easy to extend new checkers
Composition pattern allows stacking multiple checkers
But increases code complexity

Design Two: Probability Threshold

static constexpr double MIN_DEVIATION_PROBABILITY = 0.05;

Choice: Use 5% probability threshold for anomaly detection

Trade-off considerations:

Hard thresholds are too rigid; probability thresholds are more flexible
Requires maintaining historical data to calculate probabilities
But can better adapt to data distribution changes

Design Three: Historical Comparison

TCheckResult Check(..., const TReports& oldReports, ...);

Choice: Use historical reports to detect distribution drift

Trade-off considerations:

Can detect gradual degradation
Requires additional storage for historical data
Not friendly to cold start

Clean-room Reimplementation: Python Implementation

To demonstrate the design thinking, I reimplemented the core logic in Python:

from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class PoolRecord:
    query: str
    relevance: float
    url: str
    factors: List[float]

@dataclass
class CheckResult:
    passed: bool
    message: str
    severity: str

class IPoolChangeChecker(ABC):
    @abstractmethod
    def get_name(self) -> str: pass
    
    @abstractmethod
    def check(self, old_pool, new_pool, old_reports=None) -> CheckResult: pass

class QueryRelUrlLossChecker(IPoolChangeChecker):
    OK_LOSS_RATIO = 0.05
    WARN_LOSS_RATIO = 0.10
    
    def get_name(self) -> str:
        return "Query-Rel-Url Loss Check"
    
    def check(self, old_pool, new_pool, old_reports=None) -> CheckResult:
        old_set = {(r.query, round(r.relevance, 2), r.url) for r in old_pool}
        new_set = {(r.query, round(r.relevance, 2), r.url) for r in new_pool}
        
        if not old_set:
            return CheckResult(True, "No old pool data", "ok")
        
        loss_ratio = len(old_set - new_set) / len(old_set)
        
        if loss_ratio <= self.OK_LOSS_RATIO:
            return CheckResult(True, f"Loss ratio {loss_ratio:.2%} OK", "ok")
        elif loss_ratio <= self.WARN_LOSS_RATIO:
            return CheckResult(False, f"Loss ratio {loss_ratio:.2%} warning", "warning")
        else:
            return CheckResult(False, f"Loss ratio {loss_ratio:.2%} critical!", "error")

class FactorsDistributionChecker(IPoolChangeChecker):
    MIN_DEVIATION_PROBABILITY = 0.05
    
    def get_name(self) -> str:
        return "Factor Distribution Check"
    
    def check(self, old_pool, new_pool, old_reports=None) -> CheckResult:
        if not old_pool:
            return CheckResult(True, "Insufficient data", "ok")
        
        factor_count = len(old_pool[0].factors)
        deviations = []
        
        for factor_idx in range(factor_count):
            old_mean = sum(r.factors[factor_idx] for r in old_pool) / len(old_pool)
            new_mean = sum(r.factors[factor_idx] for r in new_pool) / len(new_pool)
            if old_mean > 0:
                deviations.append(abs(new_mean - old_mean) / old_mean)
        
        avg_deviation = sum(deviations) / len(deviations) if deviations else 0
        
        if avg_deviation < self.MIN_DEVIATION_PROBABILITY:
            return CheckResult(True, f"Deviation {avg_deviation:.2%} OK", "ok")
        else:
            return CheckResult(False, f"Deviation {avg_deviation:.2%} exceeds threshold", "warning")

class RelPercentChecker(IPoolChangeChecker):
    MIN_PERCENT_PROBABILITY = 0.05
    
    def get_name(self) -> str:
        return "Relevance Percent Check"
    
    def check(self, old_pool, new_pool, old_reports=None) -> CheckResult:
        if not old_pool:
            return CheckResult(True, "No old pool data", "ok")
        
        old_percent = sum(1 for r in old_pool if r.relevance >= 0.7) / len(old_pool)
        new_percent = sum(1 for r in new_pool if r.relevance >= 0.7) / len(new_pool) if new_pool else 0
        
        change = abs(new_percent - old_percent)
        
        if change < self.MIN_PERCENT_PROBABILITY:
            return CheckResult(True, f"Distribution stable: {new_percent:.2%}", "ok")
        else:
            return CheckResult(False, f"Distribution changed: {new_percent:.2%} vs {old_percent:.2%}", "warning")

class PoolQualityChecker:
    def __init__(self):
        self.checkers = [
            QueryRelUrlLossChecker(),
            FactorsDistributionChecker(),
            RelPercentChecker(),
        ]
    
    def check(self, old_pool, new_pool, old_reports=None):
        for checker in self.checkers:
            result = checker.check(old_pool, new_pool, old_reports)
            print(f"[{checker.get_name()}] {result.severity}: {result.message}")

# Demo
old_pool = [
    PoolRecord("python", 0.9, "url1", [0.8, 0.7, 0.9]),
    PoolRecord("python", 0.8, "url2", [0.7, 0.6, 0.8]),
    PoolRecord("python", 0.7, "url3", [0.6, 0.5, 0.7]),
]

new_pool = [
    PoolRecord("python", 0.9, "url1", [0.8, 0.7, 0.9]),
    PoolRecord("python", 0.8, "url2", [0.7, 0.6, 0.8]),
]

checker = PoolQualityChecker()
checker.check(old_pool, new_pool)

Output:

[Query-Rel-Url Loss Check] error: Loss ratio 33.33% is critical!
[Factor Distribution Check] ok: Average deviation 0.00% within acceptable range
[Relevance Percent Check] warning: Relevance distribution changed significantly: 66.67% vs 100.00%

When to Use Multi-Dimensional Quality Checking

Good fit:

Recommendation systems requiring multi-dimensional evaluation
Need to ensure quality doesn't degrade during algorithm iterations
Have historical data for comparison

Poor fit:

Simple scenarios with single metric
Real-time critical scenarios
Cold start phase

Summary

Quality guardianship in industrial recommendation systems is full of trade-offs:

Interface abstraction vs. code complexity: standardization enables extension
Probability threshold vs. hard threshold: flexibility vs. simplicity
Historical comparison vs. real-time calculation: accuracy vs. performance

In Python, we can implement similar designs more concisely, but the core trade-offs remain the same — there's no perfect solution, only choices that fit the scenario.