← Back to Blog

The Beauty of Simplicity: Lightweight Lemmatizer State Machine Design

In natural language processing, lemmatization is the process of reducing word forms to their base forms (lemmas). Compared to the full-featured MyStem morphological analyzer, lightweight lemmatizers pursue lower resource consumption and faster processing. How do industrial systems achieve lightweight while maintaining functionality? Let's analyze a lightweight lemmatizer interface design.

The Core Problem: Balancing Lightweight and Functionality

The fundamental challenges in lightweight lemmatization:

  1. Resource constraints: Need to run in memory-limited environments
  2. Performance requirements: Fast processing is essential
  3. Simplified functionality: Full morphological analysis not needed

The solution is simplified interface + key comparison:

  • Remove complex morphological analysis
  • Use Keys for form comparison
  • Maintain C interface for cross-language calling

Core Design in Industrial Implementation

In an industrial lemmatization library, I found a lightweight C interface design. Its design choices are remarkably pragmatic:

Design One: Simplified Interface

typedef void LemmerHandle;
typedef void Keys;

LemmerHandle* LemmerCreate(const char* data, size_t length);
bool LemmerCompare(LemmerHandle* lemmer, const Symbol* word1, const Symbol* word2);

Choice: Minimal create and compare interface

Trade-off considerations:

  • Pros: Easy to use, clear API
  • Cons: Limited functionality, no complex morphological analysis

Design Two: Key Comparison Mechanism

Keys* LemmerCreateKeysForForm(LemmerHandle* lemmer, const Symbol* word);
bool LemmerCompareKeysAndForm(LemmerHandle* lemmer, const Keys* keys, const Symbol* word);

Choice: Use keys for form matching

Trade-off considerations:

  • Pros: Faster than full analysis, suitable for simple matching
  • Cons: Cannot handle complex morphological variations

Design Three: Unified Symbol Type

typedef unsigned short Symbol;

Choice: Use 16-bit symbol type

Trade-off considerations:

  • Pros: Sufficient range for Unicode
  • Cons: Less flexible than 32-bit

Clean-room Reimplementation: Rust Implementation

To demonstrate the design thinking, I reimplemented the core logic in Rust:

use std::fmt;

pub struct LemmerHandle {
    dictionary: Vec<u8>,
}

pub struct Keys {
    forms: Vec<String>,
}

#[derive(Debug)]
pub struct CompareResult {
    pub equal: bool,
    pub similarity: f32,
}

impl LemmerHandle {
    pub fn create(data: &[u8]) -> Result<Self, String> {
        if data.is_empty() {
            return Err("Empty dictionary".to_string());
        }
        Ok(Self {
            dictionary: data.to_vec(),
        })
    }
    
    pub fn compare(&self, word1: &[u16], word2: &[u16]) -> CompareResult {
        let s1: String = word1.iter()
            .filter_map(|&c| char::from_u32(c as u32))
            .collect();
        let s2: String = word2.iter()
            .filter_map(|&c| char::from_u32(c as u32))
            .collect();
        
        if s1 == s2 {
            CompareResult { equal: true, similarity: 1.0 }
        } else if s1.starts_with(&s2) || s2.starts_with(&s1) {
            let shorter = s1.len().min(s2.len()) as f32;
            let longer = s1.len().max(s2.len()) as f32;
            CompareResult { equal: false, similarity: shorter / longer }
        } else {
            CompareResult { equal: false, similarity: 0.0 }
        }
    }
}

impl Keys {
    pub fn create_for_form(_lemmer: &LemmerHandle, word: &[u16]) -> Self {
        let s: String = word.iter()
            .filter_map(|&c| char::from_u32(c as u32))
            .collect();
        
        let mut forms = Vec::new();
        for len in 1..=s.len().min(4) {
            forms.push(s[..len].to_string());
        }
        
        Self { forms }
    }
    
    pub fn compare_with_form(&self, _lemmer: &LemmerHandle, word: &[u16]) -> bool {
        let s: String = word.iter()
            .filter_map(|&c| char::from_u32(c as u32))
            .collect();
        self.forms.contains(&s)
    }
}

fn main() {
    let dictionary = b"dictionary_data";
    let lemmer = LemmerHandle::create(dictionary)
        .expect("Failed to create lemmer");
    
    let word1: Vec<u16> = "running".encode_utf16().collect();
    let word2: Vec<u16> = "run".encode_utf16().collect();
    
    let result = lemmer.compare(&word1, &word2);
    println!("Comparing: {} vs {}", 
        String::from_utf16_lossy(&word1),
        String::from_utf16_lossy(&word2));
    println!("Result: equal={}, similarity={:.2}", 
        result.equal, result.similarity);
    
    let keys = Keys::create_for_form(&lemmer, &word1);
    let word3: Vec<u16> = "run".encode_utf16().collect();
    let matches = keys.compare_with_form(&lemmer, &word3);
    println!("Key matches 'run': {}", matches);
}

Output:

Comparing: running vs run
Result: equal=false, similarity=0.43
Key matches 'run': true

When to Use Lightweight Lemmatization

Good fit:

  • Resource-constrained environments
  • Simple form matching only
  • Need to process large amounts of text quickly

Poor fit:

  • Need full morphological analysis
  • Complex language processing
  • High precision requirements

Summary

Design in lightweight lemmatizers is full of trade-offs:

  • Simplified interface vs. rich functionality: easy to use vs. feature-complete
  • Key comparison vs. full analysis: fast vs. accurate
  • 16-bit symbols vs. 32-bit: memory savings vs. broader range

In Rust, we can implement similar designs more safely, but the core trade-offs remain the same — every design choice has a cost.