OCR-StringDist

A Python library to learn, model, explain and correct OCR errors using a fast string distance engine.

Repository:: https://github.com/NiklasvonM/ocr-stringdist
Current version:: 1.0.1

Motivation

Standard string distances (like Levenshtein) treat all character substitutions equally. This is suboptimal for text read from images via OCR, where errors like O vs 0 are far more common than, say, O vs X.

OCR-StringDist provides a learnable weighted Levenshtein distance, implementing part of the Noisy Channel model.

Example: Matching against the correct word CODE:

Standard Levenshtein:
- \(d(\text{"C0DE"}, \text{"CODE"}) = 1\) (0 → O)
- \(d(\text{"CXDE"}, \text{"CODE"}) = 1\) (X → O)
- Result: Both appear equally likely/distant.
OCR-StringDist (Weighted):
- \(d(\text{"C0DE"}, \text{"CODE"}) \approx 0.1\) (common error, low cost)
- \(d(\text{"CXDE"}, \text{"CODE"}) = 1.0\) (unlikely error, high cost)
- Result: Correctly identifies C0DE as a much closer match.

This makes it ideal for matching potentially incorrect OCR output against known values (e.g., product codes). By combining this channel model with a source model (e.g., product code frequencies), you can build a complete and robust OCR correction system.

Features

Learnable Costs: Automatically learn substitution, insertion, and deletion costs from a dataset of (OCR string, ground truth string) pairs.
Weighted Levenshtein Distance: Models OCR error patterns by assigning custom costs to specific edit operations.
High Performance: Core logic in Rust and a batch_distance function for efficiently comparing one string against thousands of candidates.
Substitution of Multiple Characters: Not just character pairs, but string pairs may be substituted, for example the Korean syllable “이” for the two letters “OI”.
Explainable Edit Path: Returns the optimal sequence of edit operations (substitutions, insertions, and deletions) used to transform one string into another.
Pre-defined OCR Distance Map: A built-in distance map for common OCR confusions (e.g., “0” vs “O”, “1” vs “l”, “5” vs “S”).
Full Unicode Support: Works with arbitrary Unicode strings.

OCR-StringDist

Motivation

Features

Contents