OCR-StringDist
A Python library for fast string distance calculations that account for common OCR (optical character recognition) errors.
- Repository:
- Current version:
0.1.0
Motivation
Standard string distances (like Levenshtein) treat all character substitutions equally. This is suboptimal for text read from images via OCR, where errors like O vs 0 are far more common than, say, O vs X.
OCR-StringDist uses a weighted Levenshtein distance, assigning lower costs to common OCR errors.
Example: Matching against the correct word CODE:
- Standard Levenshtein:
\(d(\text{"C0DE"}, \text{"CODE"}) = 1\) (0 → O)
\(d(\text{"CXDE"}, \text{"CODE"}) = 1\) (X → O)
Result: Both appear equally likely/distant.
- OCR-StringDist (Weighted):
\(d(\text{"C0DE"}, \text{"CODE"}) \approx 0.1\) (common error, low cost)
\(d(\text{"CXDE"}, \text{"CODE"}) = 1.0\) (unlikely error, high cost)
Result: Correctly identifies C0DE as a much closer match.
This makes it ideal for matching potentially incorrect OCR output against known values (e.g., product codes, database entries).
Features
Weighted Levenshtein Distance: Calculates Levenshtein distance with customizable costs for substitutions, insertions, and deletions. Includes an efficient batch version (batch_weighted_levenshtein_distance) for comparing one string against many candidates.
Substitution of Multiple Characters: Not just character pairs, but string pairs may be substituted, for example the Korean syllable “이” for the two letters “OI”.
Pre-defined OCR Distance Map: A built-in distance map for common OCR confusions (e.g., “0” vs “O”, “1” vs “l”, “5” vs “S”).
Unicode Support: Works with arbitrary Unicode strings.
Best Match Finder: Includes a utility function find_best_candidate to efficiently find the best match from a list based on _any_ distance function.