The Guardian of Recommendation Systems: Industrial Quality Scoring Architecture
In recommendation systems, the quality of the recommendation pool directly determines user experience. When recommendation algorithms are updated, how do we ensure the new recommendations aren't worse than the old ones? This is a problem every industrial system must face. Today we'll analyze a real video recommendation quality checking module to see how it uses multi-dimensional scoring to safeguard recommendation quality.
The Core Problem: Guarding Recommendation Pool Quality
The fundamental challenges in recommendation systems:
- Algorithm iteration: Each algorithm update can affect recommendation quality
- Silent degradation: Some metrics may decline unnoticed
- Multi-dimensional evaluation: Quality can't be judged by a single metric alone
The solution is multi-dimensional quality checkers:
- Loss detection: Retention rate of critical query-rel-url combinations
- Distribution detection: Whether quality factor distributions are reasonable
- Percentage detection: Proportion of high-relevance content
Core Design in Industrial Implementation
In an industrial-grade video search system, I found a quality checking module that has been refined over years. Its design choices are remarkably pragmatic:
Design One: Interface Abstraction
typedef IDataChangeChecker<TPoolRecords> IPoolChangeChecker;
Choice: Use interface abstraction to define checkers
Trade-off considerations:
- Standardized interface makes it easy to extend new checkers
- Composition pattern allows stacking multiple checkers
- But increases code complexity
Design Two: Probability Threshold
static constexpr double MIN_DEVIATION_PROBABILITY = 0.05;
Choice: Use 5% probability threshold for anomaly detection
Trade-off considerations:
- Hard thresholds are too rigid; probability thresholds are more flexible
- Requires maintaining historical data to calculate probabilities
- But can better adapt to data distribution changes
Design Three: Historical Comparison
TCheckResult Check(..., const TReports& oldReports, ...);
Choice: Use historical reports to detect distribution drift
Trade-off considerations:
- Can detect gradual degradation
- Requires additional storage for historical data
- Not friendly to cold start
Clean-room Reimplementation: Python Implementation
To demonstrate the design thinking, I reimplemented the core logic in Python:
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class PoolRecord:
query: str
relevance: float
url: str
factors: List[float]
@dataclass
class CheckResult:
passed: bool
message: str
severity: str
class IPoolChangeChecker(ABC):
@abstractmethod
def get_name(self) -> str: pass
@abstractmethod
def check(self, old_pool, new_pool, old_reports=None) -> CheckResult: pass
class QueryRelUrlLossChecker(IPoolChangeChecker):
OK_LOSS_RATIO = 0.05
WARN_LOSS_RATIO = 0.10
def get_name(self) -> str:
return "Query-Rel-Url Loss Check"
def check(self, old_pool, new_pool, old_reports=None) -> CheckResult:
old_set = {(r.query, round(r.relevance, 2), r.url) for r in old_pool}
new_set = {(r.query, round(r.relevance, 2), r.url) for r in new_pool}
if not old_set:
return CheckResult(True, "No old pool data", "ok")
loss_ratio = len(old_set - new_set) / len(old_set)
if loss_ratio <= self.OK_LOSS_RATIO:
return CheckResult(True, f"Loss ratio {loss_ratio:.2%} OK", "ok")
elif loss_ratio <= self.WARN_LOSS_RATIO:
return CheckResult(False, f"Loss ratio {loss_ratio:.2%} warning", "warning")
else:
return CheckResult(False, f"Loss ratio {loss_ratio:.2%} critical!", "error")
class FactorsDistributionChecker(IPoolChangeChecker):
MIN_DEVIATION_PROBABILITY = 0.05
def get_name(self) -> str:
return "Factor Distribution Check"
def check(self, old_pool, new_pool, old_reports=None) -> CheckResult:
if not old_pool:
return CheckResult(True, "Insufficient data", "ok")
factor_count = len(old_pool[0].factors)
deviations = []
for factor_idx in range(factor_count):
old_mean = sum(r.factors[factor_idx] for r in old_pool) / len(old_pool)
new_mean = sum(r.factors[factor_idx] for r in new_pool) / len(new_pool)
if old_mean > 0:
deviations.append(abs(new_mean - old_mean) / old_mean)
avg_deviation = sum(deviations) / len(deviations) if deviations else 0
if avg_deviation < self.MIN_DEVIATION_PROBABILITY:
return CheckResult(True, f"Deviation {avg_deviation:.2%} OK", "ok")
else:
return CheckResult(False, f"Deviation {avg_deviation:.2%} exceeds threshold", "warning")
class RelPercentChecker(IPoolChangeChecker):
MIN_PERCENT_PROBABILITY = 0.05
def get_name(self) -> str:
return "Relevance Percent Check"
def check(self, old_pool, new_pool, old_reports=None) -> CheckResult:
if not old_pool:
return CheckResult(True, "No old pool data", "ok")
old_percent = sum(1 for r in old_pool if r.relevance >= 0.7) / len(old_pool)
new_percent = sum(1 for r in new_pool if r.relevance >= 0.7) / len(new_pool) if new_pool else 0
change = abs(new_percent - old_percent)
if change < self.MIN_PERCENT_PROBABILITY:
return CheckResult(True, f"Distribution stable: {new_percent:.2%}", "ok")
else:
return CheckResult(False, f"Distribution changed: {new_percent:.2%} vs {old_percent:.2%}", "warning")
class PoolQualityChecker:
def __init__(self):
self.checkers = [
QueryRelUrlLossChecker(),
FactorsDistributionChecker(),
RelPercentChecker(),
]
def check(self, old_pool, new_pool, old_reports=None):
for checker in self.checkers:
result = checker.check(old_pool, new_pool, old_reports)
print(f"[{checker.get_name()}] {result.severity}: {result.message}")
# Demo
old_pool = [
PoolRecord("python", 0.9, "url1", [0.8, 0.7, 0.9]),
PoolRecord("python", 0.8, "url2", [0.7, 0.6, 0.8]),
PoolRecord("python", 0.7, "url3", [0.6, 0.5, 0.7]),
]
new_pool = [
PoolRecord("python", 0.9, "url1", [0.8, 0.7, 0.9]),
PoolRecord("python", 0.8, "url2", [0.7, 0.6, 0.8]),
]
checker = PoolQualityChecker()
checker.check(old_pool, new_pool)
Output:
[Query-Rel-Url Loss Check] error: Loss ratio 33.33% is critical!
[Factor Distribution Check] ok: Average deviation 0.00% within acceptable range
[Relevance Percent Check] warning: Relevance distribution changed significantly: 66.67% vs 100.00%
When to Use Multi-Dimensional Quality Checking
Good fit:
- Recommendation systems requiring multi-dimensional evaluation
- Need to ensure quality doesn't degrade during algorithm iterations
- Have historical data for comparison
Poor fit:
- Simple scenarios with single metric
- Real-time critical scenarios
- Cold start phase
Summary
Quality guardianship in industrial recommendation systems is full of trade-offs:
- Interface abstraction vs. code complexity: standardization enables extension
- Probability threshold vs. hard threshold: flexibility vs. simplicity
- Historical comparison vs. real-time calculation: accuracy vs. performance
In Python, we can implement similar designs more concisely, but the core trade-offs remain the same — there's no perfect solution, only choices that fit the scenario.