Usage Examples

Basic Distance Calculation

Using the default pre-defined map for common OCR errors:

from ocr_stringdist import WeightedLevenshtein

# Compare "OCR5" and "OCRS"
# The default ocr_distance_map gives 'S' <-> '5' a cost of 0.3
distance: float = WeightedLevenshtein().distance("OCR5", "OCRS")
print(f"Distance between 'OCR5' and 'OCRS' (default map): {distance}")
# Output: Distance between 'OCR5' and 'OCRS' (default map): 0.3

Using Custom Costs

Define your own substitution costs:

from ocr_stringdist import WeightedLevenshtein

# Define a custom cost for substituting "rn" with "m"
wl = WeightedLevenshtein(substitution_costs={("rn", "m"): 0.5})

distance = wl.distance("Churn Bucket", "Chum Bucket")
print(f"Distance using custom map: {distance}") # 0.5

Explaining Edit Operations

You can get a detailed list of edit operations needed to transform one string into another.

from ocr_stringdist import WeightedLevenshtein

wl = WeightedLevenshtein(substitution_costs={("日月", "明"): 0.4, ("末", "未"): 0.3})

s1 = "末日月"  # mò rì yuè
s2 = "未明"  # wèi míng

operations = wl.explain(s1, s2)
print(operations)

# Output:
# [
#   EditOperation(op_type='substitute', source_token='末', target_token='未', cost=0.3),
#   EditOperation(op_type='substitute', source_token='日月', target_token='明', cost=0.4)
# ]

Learning Costs from Data

The custom costs can be learned from a dataset of pairs of (OCR output, ground truth).

from ocr_stringdist import WeightedLevenshtein

training_data = [
    ("Hallo", "Hello"),
    ("Hello", "Hello"), # Include correct pairs too
    ("W0rld", "World"),
]

# Learn costs from the dataset
learned_wl = WeightedLevenshtein.learn_from(training_data)

# Use the learned costs for distance calculation
distance = learned_wl.distance("Hay", "Hey")
print(f"Distance with learned costs: {distance}")  # < 1.0

Note that this by default only supports learning from character-level edits. If multi-character tokens are to be considered, an initial_model that’s already configured to know specific multi-character edits needs to be provided.