API Reference
- class ocr_stringdist.WeightedLevenshtein(substitution_costs: dict[tuple[str, str], float] | None = None, insertion_costs: dict[str, float] | None = None, deletion_costs: dict[str, float] | None = None, *, symmetric_substitution: bool = True, default_substitution_cost: float = 1.0, default_insertion_cost: float = 1.0, default_deletion_cost: float = 1.0)[source]
Calculates Levenshtein distance with custom, configurable costs.
This class is initialized with cost dictionaries and settings that define how the distance is measured. Once created, its methods can be used to efficiently compute distances and explain the edit operations.
- Parameters:
substitution_costs – Maps (char, char) tuples to their substitution cost. Defaults to costs based on common OCR errors.
insertion_costs – Maps a character to its insertion cost.
deletion_costs – Maps a character to its deletion cost.
symmetric_substitution – If True, substitution costs are bidirectional.
default_substitution_cost – Default cost for substitutions not in the map.
default_insertion_cost – Default cost for insertions not in the map.
default_deletion_cost – Default cost for deletions not in the map.
- batch_distance(s: str, candidates: list[str]) list[float] [source]
Calculates distances between a string and a list of candidates.
- distance(s1: str, s2: str) float [source]
Calculates the weighted Levenshtein distance between two strings.
- explain(s1: str, s2: str) list[EditOperation] [source]
Returns the list of edit operations to transform s1 into s2.
- classmethod unweighted() WeightedLevenshtein [source]
Creates an instance with all operations having equal cost of 1.0.
- ocr_stringdist.weighted_levenshtein_distance(s1: str, s2: str, /, substitution_costs: dict[tuple[str, str], float] | None = None, insertion_costs: dict[str, float] | None = None, deletion_costs: dict[str, float] | None = None, *, symmetric_substitution: bool = True, default_substitution_cost: float = 1.0, default_insertion_cost: float = 1.0, default_deletion_cost: float = 1.0) float [source]
Levenshtein distance with custom substitution, insertion and deletion costs.
See also
WeightedLevenshtein.distance()
.The default substitution_costs considers common OCR errors, see
ocr_stringdist.default_ocr_distances.ocr_distance_map
.- Parameters:
s1 – First string (interpreted as the string read via OCR)
s2 – Second string
substitution_costs – Dictionary mapping tuples of strings (“substitution tokens”) to their substitution costs. Only one direction needs to be configured unless symmetric_substitution is False. Note that the runtime scales in the length of the longest substitution token. Defaults to ocr_stringdist.ocr_distance_map.
insertion_costs – Dictionary mapping strings to their insertion costs.
deletion_costs – Dictionary mapping strings to their deletion costs.
symmetric_substitution – Should the keys of substitution_costs be considered to be symmetric? Defaults to True.
default_substitution_cost – The default substitution cost for character pairs not found in substitution_costs.
default_insertion_cost – The default insertion cost for characters not found in insertion_costs.
default_deletion_cost – The default deletion cost for characters not found in deletion_costs.
- ocr_stringdist.batch_weighted_levenshtein_distance(s: str, candidates: list[str], /, substitution_costs: dict[tuple[str, str], float] | None = None, insertion_costs: dict[str, float] | None = None, deletion_costs: dict[str, float] | None = None, *, symmetric_substitution: bool = True, default_substitution_cost: float = 1.0, default_insertion_cost: float = 1.0, default_deletion_cost: float = 1.0) list[float] [source]
Calculate weighted Levenshtein distances between a string and multiple candidates.
See also
WeightedLevenshtein.batch_distance()
.This is more efficient than calling
weighted_levenshtein_distance()
multiple times.- Parameters:
s – The string to compare (interpreted as the string read via OCR)
candidates – List of candidate strings to compare against
substitution_costs – Dictionary mapping tuples of strings (“substitution tokens”) to their substitution costs. Only one direction needs to be configured unless symmetric_substitution is False. Note that the runtime scales in the length of the longest substitution token. Defaults to ocr_stringdist.ocr_distance_map.
insertion_costs – Dictionary mapping strings to their insertion costs.
deletion_costs – Dictionary mapping strings to their deletion costs.
symmetric_substitution – Should the keys of substitution_costs be considered to be symmetric? Defaults to True.
default_substitution_cost – The default substitution cost for character pairs not found in substitution_costs.
default_insertion_cost – The default insertion cost for characters not found in insertion_costs.
default_deletion_cost – The default deletion cost for characters not found in deletion_costs.
- Returns:
A list of distances corresponding to each candidate
- ocr_stringdist.explain_weighted_levenshtein(s1: str, s2: str, /, substitution_costs: dict[tuple[str, str], float] | None = None, insertion_costs: dict[str, float] | None = None, deletion_costs: dict[str, float] | None = None, *, symmetric_substitution: bool = True, default_substitution_cost: float = 1.0, default_insertion_cost: float = 1.0, default_deletion_cost: float = 1.0) list[EditOperation] [source]
Computes the path of operations associated with the custom Levenshtein distance.
See also
WeightedLevenshtein.explain()
.The default substitution_costs considers common OCR errors, see
ocr_stringdist.default_ocr_distances.ocr_distance_map
.- Parameters:
s1 – First string (interpreted as the string read via OCR)
s2 – Second string
substitution_costs – Dictionary mapping tuples of strings (“substitution tokens”) to their substitution costs. Only one direction needs to be configured unless symmetric_substitution is False. Note that the runtime scales in the length of the longest substitution token. Defaults to ocr_stringdist.ocr_distance_map.
insertion_costs – Dictionary mapping strings to their insertion costs.
deletion_costs – Dictionary mapping strings to their deletion costs.
symmetric_substitution – Should the keys of substitution_costs be considered to be symmetric? Defaults to True.
default_substitution_cost – The default substitution cost for character pairs not found in substitution_costs.
default_insertion_cost – The default insertion cost for characters not found in insertion_costs.
default_deletion_cost – The default deletion cost for characters not found in deletion_costs.
- Returns:
List of
EditOperation
instances.
- ocr_stringdist.matching.find_best_candidate(s: str, candidates: Iterable[str], distance_fun: Callable[[str, str], float], *, minimize: bool = True, early_return_value: float | None = None) tuple[str, float] [source]
Finds the best matching string from a collection of candidates based on a distance function.
Compares a given string against each string in the ‘candidates’ iterable using the provided ‘distance_fun’. It identifies the candidate that yields the minimum (or maximum, if minimize=False) distance.
- Parameters:
s (str) – The reference string to compare against.
candidates (Iterable[str]) – An iterable of candidate strings to compare with ‘s’.
distance_fun (Callable[[str, str], float]) – A function that takes two strings (s, candidate) and returns a float representing their distance or similarity.
minimize (bool) – If True (default), finds the candidate with the minimum distance. If False, finds the candidate with the maximum distance (useful for similarity scores).
early_return_value (Optional[float]) – If provided, the function will return immediately if a distance is found that is less than or equal to this value (if minimize=True) or greater than or equal to this value (if minimize=False). If None (default), all candidates are checked.
- Raises:
ValueError – If the ‘candidates’ iterable is empty.
- Returns:
A tuple containing the best matching candidate string and its calculated distance/score.
- Return type:
tuple[str, float]
- Example:
>>> from ocr_stringdist import weighted_levenshtein_distance as distance >>> s = "apple" >>> candidates = ["apply", "apples", "orange", "appIe"] >>> find_best_match(s, candidates, lambda s1, s2: distance(s1, s2, {("l", "I"): 0.1})) ('appIe', 0.1)
- ocr_stringdist.default_ocr_distances.ocr_distance_map
Pre-defined distance map between characters, considering common OCR errors. The distances are between 0 and 1. This map is intended to be used with symmetric=True.
ocr_distance_map: dict[tuple[str, str], float] = {
("O", "0"): 0.1,
("l", "1"): 0.1,
("I", "1"): 0.15,
("o", "0"): 0.2,
("B", "8"): 0.25,
("S", "5"): 0.3,
("G", "6"): 0.3,
("Z", "2"): 0.3,
("C", "c"): 0.3,
("é", "e"): 0.3,
("Ä", "A"): 0.4,
("Ö", "O"): 0.4,
("Ü", "U"): 0.4,
("c", "e"): 0.4,
("a", "o"): 0.4,
("u", "v"): 0.4,
("i", "l"): 0.4,
("s", "5"): 0.4,
("m", "n"): 0.5,
("f", "s"): 0.5,
(".", ","): 0.5,
("2", "Z"): 0.5,
("t", "f"): 0.6,
("r", "n"): 0.6,
("-", "_"): 0.6,
("ß", "B"): 0.6,
("h", "b"): 0.7,
("v", "y"): 0.7,
("i", "j"): 0.7,
("é", "á"): 0.7,
("E", "F"): 0.8,
}