API Reference

ocr_stringdist.levenshtein.batch_weighted_levenshtein_distance(s: str, candidates: list[str], /, substitution_costs: dict[tuple[str, str], float] | None = None, insertion_costs: dict[str, float] | None = None, deletion_costs: dict[str, float] | None = None, *, symmetric_substitution: bool = True, default_substitution_cost: float = 1.0, default_insertion_cost: float = 1.0, default_deletion_cost: float = 1.0) → list[float][source]

Calculate weighted Levenshtein distances between a string and multiple candidates.

This is more efficient than calling weighted_levenshtein_distance() multiple times.

Parameters:

s – The string to compare (interpreted as the string read via OCR)
candidates – List of candidate strings to compare against
substitution_costs – Dictionary mapping tuples of strings (“substitution tokens”) to their substitution costs. Only one direction needs to be configured unless symmetric_substitution is False. Note that the runtime scales in the length of the longest substitution token. Defaults to ocr_stringdist.ocr_distance_map.
insertion_costs – Dictionary mapping strings to their insertion costs.
deletion_costs – Dictionary mapping strings to their deletion costs.
symmetric_substitution – Should the keys of substitution_costs be considered to be symmetric? Defaults to True.
default_substitution_cost – The default substitution cost for character pairs not found in substitution_costs.
default_insertion_cost – The default insertion cost for characters not found in insertion_costs.
default_deletion_cost – The default deletion cost for characters not found in deletion_costs.

Returns:

A list of distances corresponding to each candidate

ocr_stringdist.levenshtein.weighted_levenshtein_distance(s1: str, s2: str, /, substitution_costs: dict[tuple[str, str], float] | None = None, insertion_costs: dict[str, float] | None = None, deletion_costs: dict[str, float] | None = None, *, symmetric_substitution: bool = True, default_substitution_cost: float = 1.0, default_insertion_cost: float = 1.0, default_deletion_cost: float = 1.0) → float[source]

Levenshtein distance with custom substitution, insertion and deletion costs.

The default substitution_costs considers common OCR errors, see ocr_stringdist.default_ocr_distances.ocr_distance_map.

Parameters:

s1 – First string (interpreted as the string read via OCR)
s2 – Second string
substitution_costs – Dictionary mapping tuples of strings (“substitution tokens”) to their substitution costs. Only one direction needs to be configured unless symmetric_substitution is False. Note that the runtime scales in the length of the longest substitution token. Defaults to ocr_stringdist.ocr_distance_map.
insertion_costs – Dictionary mapping strings to their insertion costs.
deletion_costs – Dictionary mapping strings to their deletion costs.
symmetric_substitution – Should the keys of substitution_costs be considered to be symmetric? Defaults to True.
default_substitution_cost – The default substitution cost for character pairs not found in substitution_costs.
default_insertion_cost – The default insertion cost for characters not found in insertion_costs.
default_deletion_cost – The default deletion cost for characters not found in deletion_costs.

ocr_stringdist.matching.find_best_candidate(s: str, candidates: Iterable[str], distance_fun: Callable[[str, str], float], *, minimize: bool = True, early_return_value: float | None = None) → tuple[str, float][source]

Finds the best matching string from a collection of candidates based on a distance function.

Compares a given string against each string in the ‘candidates’ iterable using the provided ‘distance_fun’. It identifies the candidate that yields the minimum (or maximum, if minimize=False) distance.

Parameters:

s (str) – The reference string to compare against.
candidates (Iterable[str]) – An iterable of candidate strings to compare with ‘s’.
distance_fun (Callable[[str, str], float]) – A function that takes two strings (s, candidate) and returns a float representing their distance or similarity.
minimize (bool) – If True (default), finds the candidate with the minimum distance. If False, finds the candidate with the maximum distance (useful for similarity scores).
early_return_value (Optional[float]) – If provided, the function will return immediately if a distance is found that is less than or equal to this value (if minimize=True) or greater than or equal to this value (if minimize=False). If None (default), all candidates are checked.

Raises:

ValueError – If the ‘candidates’ iterable is empty.

Returns:

A tuple containing the best matching candidate string and its calculated distance/score.

Return type:

tuple[str, float]

Example:

>>> from ocr_stringdist import weighted_levenshtein_distance as distance
>>> s = "apple"
>>> candidates = ["apply", "apples", "orange", "appIe"]
>>> find_best_match(s, candidates, lambda s1, s2: distance(s1, s2, {("l", "I"): 0.1}))
('appIe', 0.1)

ocr_stringdist.default_ocr_distances.ocr_distance_map: Pre-defined distance map between characters, considering common OCR errors. The distances are between 0 and 1. This map is intended to be used with symmetric=True.

ocr_distance_map: dict[tuple[str, str], float] = {
    ("O", "0"): 0.1,
    ("l", "1"): 0.1,
    ("I", "1"): 0.15,
    ("o", "0"): 0.2,
    ("B", "8"): 0.25,
    ("S", "5"): 0.3,
    ("G", "6"): 0.3,
    ("Z", "2"): 0.3,
    ("C", "c"): 0.3,
    ("é", "e"): 0.3,
    ("Ä", "A"): 0.4,
    ("Ö", "O"): 0.4,
    ("Ü", "U"): 0.4,
    ("c", "e"): 0.4,
    ("a", "o"): 0.4,
    ("u", "v"): 0.4,
    ("i", "l"): 0.4,
    ("s", "5"): 0.4,
    ("m", "n"): 0.5,
    ("f", "s"): 0.5,
    (".", ","): 0.5,
    ("2", "Z"): 0.5,
    ("t", "f"): 0.6,
    ("r", "n"): 0.6,
    ("-", "_"): 0.6,
    ("ß", "B"): 0.6,
    ("h", "b"): 0.7,
    ("v", "y"): 0.7,
    ("i", "j"): 0.7,
    ("é", "á"): 0.7,
    ("E", "F"): 0.8,
}