Host-based Regex Optimization in Crawlers

When building large-scale web crawlers, URL filtering is often an underestimated performance bottleneck. When you need to process tens of thousands of URLs per second, simply running Regex::is_match for every link can quickly exhaust CPU resources.

This is especially true in scenarios involving host-specific rule matching—such as "only crawl the /blog/ directory on example.com" or "ignore all links ending in .jpg." Traditional regex matching strategies often fall short here.

This article explores how to leverage features of the Rust regex crate—specifically RegexSet—to build a high-performance domain matching system.

Why Single Regex Matching Isn't Enough

The most intuitive implementation is to maintain a list of rules for each domain:

struct HostRules {
    allow_patterns: Vec<Regex>,
    deny_patterns: Vec<Regex>,
}

impl HostRules {
    fn is_allowed(&self, path: &str) -> bool {
        // Check deny rules first
        for pattern in &self.deny_patterns {
            if pattern.is_match(path) {
                return false;
            }
        }
        // Then check allow rules
        for pattern in &self.allow_patterns {
            if pattern.is_match(path) {
                return true;
            }
        }
        false
    }
}

This $O(N \times M)$ complexity (where N is the number of rules and M is the path length) degrades linearly as the number of rules increases. If a site has 50 filtering rules, running the regex engine 50 times for every URL is unacceptable at scale.

The Solution: The Power of RegexSet

Rust's regex library provides an extremely powerful tool: RegexSet. It allows us to compile multiple regular expressions into a single automaton. This means that whether you have 10 rules or 1000, the overhead of scanning text once remains $O(M)$, largely independent of the number of rules.

Designing the HostMatcher

We can redesign our data structure to pre-compile rules of the same type (Allow/Deny) together.

use regex::RegexSet;

pub struct HostMatcher {
    allow_set: RegexSet,
    deny_set: RegexSet,
}

impl HostMatcher {
    pub fn new(allow: &[&str], deny: &[&str]) -> Result<Self, regex::Error> {
        Ok(Self {
            allow_set: RegexSet::new(allow)?,
            deny_set: RegexSet::new(deny)?,
        })
    }

    pub fn check(&self, path: &str) -> Action {
        // Check Deny first, as rejection rules usually have higher priority for safety
        if self.deny_set.is_match(path) {
            return Action::Deny;
        }
        
        if self.allow_set.is_match(path) {
            return Action::Allow;
        }

        Action::Pass // Or default deny, depending on policy
    }
}

#[derive(Debug, PartialEq)]
pub enum Action {
    Allow,
    Deny,
    Pass,
}

Performance Comparisons and Considerations

The core advantage of using RegexSet is that it avoids multiple backtrackings and state resets. Rust's regex engine is based on Finite Automata, guaranteeing linear time complexity and preventing "Catastrophic Backtracking" found in some other engines.

In practice, the advantage of RegexSet becomes apparent when the number of rules exceeds 5-10. For complex site configurations containing dozens of path rules, throughput improvements are typically several-fold.

Further Optimization: Pre-filtering with Aho-Corasick

While regex is powerful, if the rules are simple string matches (like prefixes or suffixes), a regex engine is still overkill.

For pure text matching, the Aho-Corasick algorithm (AC automaton) is a better choice. Rust's aho-corasick crate works well alongside regex:

Fast Path: Check all plain text rules using the AC automaton first.
Slow Path: If no match is found, fall back to RegexSet for complex patterns (e.g., ^/article/\d{4}/.*).

Conclusion

In system design, don't blindly trust "standard practices." Looping through regex matches works in scripts but is poison for performance in high-concurrency crawlers. By leveraging Rust's strong type system and high-performance libraries, we can achieve enterprise-grade matching efficiency with very little code.

Remember: The fastest code is the code that never runs.