OCR-StringDist

A Python library for fast string distance calculations that account for common OCR (optical character recognition) errors.

Repository:

https://niklasvonm.github.io/ocr-stringdist/

Current version:

0.1.0

PyPI License

Motivation

Standard string distances (like Levenshtein) treat all character substitutions equally. This is suboptimal for text read from images via OCR, where errors like O vs 0 are far more common than, say, O vs X.

OCR-StringDist uses a weighted Levenshtein distance, assigning lower costs to common OCR errors.

Example: Matching against the correct word CODE:

  • Standard Levenshtein:
    • \(d(\text{"C0DE"}, \text{"CODE"}) = 1\) (0 → O)

    • \(d(\text{"CXDE"}, \text{"CODE"}) = 1\) (X → O)

    • Result: Both appear equally likely/distant.

  • OCR-StringDist (Weighted):
    • \(d(\text{"C0DE"}, \text{"CODE"}) \approx 0.1\) (common error, low cost)

    • \(d(\text{"CXDE"}, \text{"CODE"}) = 1.0\) (unlikely error, high cost)

    • Result: Correctly identifies C0DE as a much closer match.

This makes it ideal for matching potentially incorrect OCR output against known values (e.g., product codes, database entries).

Features

  • Weighted Levenshtein Distance: Calculates Levenshtein distance with customizable costs for substitutions, insertions, and deletions. Includes an efficient batch version (batch_weighted_levenshtein_distance) for comparing one string against many candidates.

  • Substitution of Multiple Characters: Not just character pairs, but string pairs may be substituted, for example the Korean syllable “이” for the two letters “OI”.

  • Pre-defined OCR Distance Map: A built-in distance map for common OCR confusions (e.g., “0” vs “O”, “1” vs “l”, “5” vs “S”).

  • Unicode Support: Works with arbitrary Unicode strings.

  • Best Match Finder: Includes a utility function find_best_candidate to efficiently find the best match from a list based on _any_ distance function.

Contents