OCR-StringDist

A Python library to learn, model, explain and correct OCR errors using a fast string distance engine.

Repository:

https://github.com/NiklasvonM/ocr-stringdist

Current version:

1.0.1

PyPI License

Motivation

Standard string distances (like Levenshtein) treat all character substitutions equally. This is suboptimal for text read from images via OCR, where errors like O vs 0 are far more common than, say, O vs X.

OCR-StringDist provides a learnable weighted Levenshtein distance, implementing part of the Noisy Channel model.

Example: Matching against the correct word CODE:

  • Standard Levenshtein:
    • \(d(\text{"C0DE"}, \text{"CODE"}) = 1\) (0 → O)

    • \(d(\text{"CXDE"}, \text{"CODE"}) = 1\) (X → O)

    • Result: Both appear equally likely/distant.

  • OCR-StringDist (Weighted):
    • \(d(\text{"C0DE"}, \text{"CODE"}) \approx 0.1\) (common error, low cost)

    • \(d(\text{"CXDE"}, \text{"CODE"}) = 1.0\) (unlikely error, high cost)

    • Result: Correctly identifies C0DE as a much closer match.

This makes it ideal for matching potentially incorrect OCR output against known values (e.g., product codes). By combining this channel model with a source model (e.g., product code frequencies), you can build a complete and robust OCR correction system.

Features

  • Learnable Costs: Automatically learn substitution, insertion, and deletion costs from a dataset of (OCR string, ground truth string) pairs.

  • Weighted Levenshtein Distance: Models OCR error patterns by assigning custom costs to specific edit operations.

  • High Performance: Core logic in Rust and a batch_distance function for efficiently comparing one string against thousands of candidates.

  • Substitution of Multiple Characters: Not just character pairs, but string pairs may be substituted, for example the Korean syllable “이” for the two letters “OI”.

  • Explainable Edit Path: Returns the optimal sequence of edit operations (substitutions, insertions, and deletions) used to transform one string into another.

  • Pre-defined OCR Distance Map: A built-in distance map for common OCR confusions (e.g., “0” vs “O”, “1” vs “l”, “5” vs “S”).

  • Full Unicode Support: Works with arbitrary Unicode strings.

Contents