API Reference
- ocr_stringdist.levenshtein.batch_weighted_levenshtein_distance(s: str, candidates: list[str], /, substitution_costs: dict[tuple[str, str], float] | None = None, insertion_costs: dict[str, float] | None = None, deletion_costs: dict[str, float] | None = None, *, symmetric_substitution: bool = True, default_substitution_cost: float = 1.0, default_insertion_cost: float = 1.0, default_deletion_cost: float = 1.0) list[float] [source]
Calculate weighted Levenshtein distances between a string and multiple candidates.
This is more efficient than calling
weighted_levenshtein_distance()
multiple times.- Parameters:
s – The string to compare (interpreted as the string read via OCR)
candidates – List of candidate strings to compare against
substitution_costs – Dictionary mapping tuples of strings (“substitution tokens”) to their substitution costs. Only one direction needs to be configured unless symmetric_substitution is False. Note that the runtime scales in the length of the longest substitution token. Defaults to ocr_stringdist.ocr_distance_map.
insertion_costs – Dictionary mapping strings to their insertion costs.
deletion_costs – Dictionary mapping strings to their deletion costs.
symmetric_substitution – Should the keys of substitution_costs be considered to be symmetric? Defaults to True.
default_substitution_cost – The default substitution cost for character pairs not found in substitution_costs.
default_insertion_cost – The default insertion cost for characters not found in insertion_costs.
default_deletion_cost – The default deletion cost for characters not found in deletion_costs.
- Returns:
A list of distances corresponding to each candidate
- ocr_stringdist.levenshtein.weighted_levenshtein_distance(s1: str, s2: str, /, substitution_costs: dict[tuple[str, str], float] | None = None, insertion_costs: dict[str, float] | None = None, deletion_costs: dict[str, float] | None = None, *, symmetric_substitution: bool = True, default_substitution_cost: float = 1.0, default_insertion_cost: float = 1.0, default_deletion_cost: float = 1.0) float [source]
Levenshtein distance with custom substitution, insertion and deletion costs.
The default substitution_costs considers common OCR errors, see
ocr_stringdist.default_ocr_distances.ocr_distance_map
.- Parameters:
s1 – First string (interpreted as the string read via OCR)
s2 – Second string
substitution_costs – Dictionary mapping tuples of strings (“substitution tokens”) to their substitution costs. Only one direction needs to be configured unless symmetric_substitution is False. Note that the runtime scales in the length of the longest substitution token. Defaults to ocr_stringdist.ocr_distance_map.
insertion_costs – Dictionary mapping strings to their insertion costs.
deletion_costs – Dictionary mapping strings to their deletion costs.
symmetric_substitution – Should the keys of substitution_costs be considered to be symmetric? Defaults to True.
default_substitution_cost – The default substitution cost for character pairs not found in substitution_costs.
default_insertion_cost – The default insertion cost for characters not found in insertion_costs.
default_deletion_cost – The default deletion cost for characters not found in deletion_costs.
- ocr_stringdist.matching.find_best_candidate(s: str, candidates: Iterable[str], distance_fun: Callable[[str, str], float], *, minimize: bool = True, early_return_value: float | None = None) tuple[str, float] [source]
Finds the best matching string from a collection of candidates based on a distance function.
Compares a given string against each string in the ‘candidates’ iterable using the provided ‘distance_fun’. It identifies the candidate that yields the minimum (or maximum, if minimize=False) distance.
- Parameters:
s (str) – The reference string to compare against.
candidates (Iterable[str]) – An iterable of candidate strings to compare with ‘s’.
distance_fun (Callable[[str, str], float]) – A function that takes two strings (s, candidate) and returns a float representing their distance or similarity.
minimize (bool) – If True (default), finds the candidate with the minimum distance. If False, finds the candidate with the maximum distance (useful for similarity scores).
early_return_value (Optional[float]) – If provided, the function will return immediately if a distance is found that is less than or equal to this value (if minimize=True) or greater than or equal to this value (if minimize=False). If None (default), all candidates are checked.
- Raises:
ValueError – If the ‘candidates’ iterable is empty.
- Returns:
A tuple containing the best matching candidate string and its calculated distance/score.
- Return type:
tuple[str, float]
- Example:
>>> from ocr_stringdist import weighted_levenshtein_distance as distance >>> s = "apple" >>> candidates = ["apply", "apples", "orange", "appIe"] >>> find_best_match(s, candidates, lambda s1, s2: distance(s1, s2, {("l", "I"): 0.1})) ('appIe', 0.1)
- ocr_stringdist.default_ocr_distances.ocr_distance_map
Pre-defined distance map between characters, considering common OCR errors. The distances are between 0 and 1. This map is intended to be used with symmetric=True.
ocr_distance_map: dict[tuple[str, str], float] = {
("O", "0"): 0.1,
("l", "1"): 0.1,
("I", "1"): 0.15,
("o", "0"): 0.2,
("B", "8"): 0.25,
("S", "5"): 0.3,
("G", "6"): 0.3,
("Z", "2"): 0.3,
("C", "c"): 0.3,
("é", "e"): 0.3,
("Ä", "A"): 0.4,
("Ö", "O"): 0.4,
("Ü", "U"): 0.4,
("c", "e"): 0.4,
("a", "o"): 0.4,
("u", "v"): 0.4,
("i", "l"): 0.4,
("s", "5"): 0.4,
("m", "n"): 0.5,
("f", "s"): 0.5,
(".", ","): 0.5,
("2", "Z"): 0.5,
("t", "f"): 0.6,
("r", "n"): 0.6,
("-", "_"): 0.6,
("ß", "B"): 0.6,
("h", "b"): 0.7,
("v", "y"): 0.7,
("i", "j"): 0.7,
("é", "á"): 0.7,
("E", "F"): 0.8,
}