API Reference
- class ocr_stringdist.WeightedLevenshtein(substitution_costs: dict[tuple[str, str], float] | None = None, insertion_costs: dict[str, float] | None = None, deletion_costs: dict[str, float] | None = None, *, symmetric_substitution: bool = True, default_substitution_cost: float = 1.0, default_insertion_cost: float = 1.0, default_deletion_cost: float = 1.0)[source]
Calculates Levenshtein distance with custom, configurable costs.
This class is initialized with cost dictionaries and settings that define how the distance is measured. Once created, its methods can be used to efficiently compute distances and explain the edit operations.
- Parameters:
substitution_costs – Maps (str, str) tuples to their substitution cost. Defaults to costs based on common OCR errors.
insertion_costs – Maps a string to its insertion cost.
deletion_costs – Maps a string to its deletion cost.
symmetric_substitution – If True, a cost defined for, e.g., (‘0’, ‘O’) will automatically apply to (‘O’, ‘0’). If False, both must be defined explicitly.
default_substitution_cost – Default cost for single-char substitutions not in the map.
default_insertion_cost – Default cost for single-char insertions not in the map.
default_deletion_cost – Default cost for single-char deletions not in the map.
- Raises:
TypeError, ValueError – If the provided arguments are invalid.
- batch_distance(s: str, candidates: list[str]) list[float] [source]
Calculates distances between a string and a list of candidates.
- distance(s1: str, s2: str) float [source]
Calculates the weighted Levenshtein distance between two strings.
- explain(s1: str, s2: str, filter_matches: bool = True) list[EditOperation] [source]
Returns the list of edit operations to transform s1 into s2.
- Parameters:
s1 – First string (interpreted as the string read via OCR)
s2 – Second string (interpreted as the target string)
filter_matches – If True, ‘match’ operations are excluded from the result.
- Returns:
List of
EditOperation
instances.
- classmethod from_dict(data: dict[str, Any]) WeightedLevenshtein [source]
Deserialize from a dictionary.
For the counterpart, see
WeightedLevenshtein.to_dict()
.- Parameters:
data –
A dictionary with (not necessarily all of) the following keys:
”substitution_costs”: {“from”: str, “to”: str, “cost”: float}
”substitution_costs”: dict[str, float]
”deletion_costs”: dict[str, float]
”symmetric_substitution”: bool
”default_substitution_cost”: float
”default_insertion_cost”: float
”default_deletion_cost”: float
- classmethod learn_from(pairs: Iterable[tuple[str, str]]) WeightedLevenshtein [source]
Creates an instance by learning costs from a dataset of (OCR, ground truth) string pairs.
For more advanced learning configuration, see the
ocr_stringdist.learner.CostLearner
class.- Parameters:
pairs – An iterable of (ocr_string, ground_truth_string) tuples. Correct pairs are not intended to be filtered; they are needed to learn well-aligned costs.
- Returns:
A new WeightedLevenshtein instance with the learned costs.
Example:
from ocr_stringdist import WeightedLevenshtein training_data = [ ("8N234", "BN234"), # read '8' instead of 'B' ("BJK18", "BJK18"), # correct ("ABC0.", "ABC0"), # extra '.' ] wl = WeightedLevenshtein.learn_from(training_data) print(wl.substitution_costs) # learned cost for substituting '8' with 'B' print(wl.deletion_costs) # learned cost for deleting '.'
- to_dict() dict[str, Any] [source]
Serializes the instance’s configuration to a dictionary.
The result can be written to, say, JSON.
For the counterpart, see
WeightedLevenshtein.from_dict()
.
- classmethod unweighted() WeightedLevenshtein [source]
Creates an instance with all operations having equal cost of 1.0.
- class ocr_stringdist.learner.CostLearner[source]
Configures and executes the process of learning Levenshtein costs from data.
This class uses a builder pattern, allowing chaining configuration methods before running the final calculation with .fit().
Example:
from ocr_stringdist import CostLearner data = [ ("Hell0", "Hello"), ] learner = CostLearner().with_smoothing(1.0) wl = learner.fit(data) # Substitution 0 -> o learned with cost < 1.0
- fit(pairs: Iterable[tuple[str, str]], *, initial_model: Aligner | None = None, calculate_for_unseen: bool = False) WeightedLevenshtein [source]
Fits the costs of a WeightedLevenshtein instance to the provided data.
Note that learning multi-character tokens is only supported if an initial alignment model is provided that can handle those multi-character tokens.
This method analyzes pairs of strings to learn the costs of edit operations based on their observed frequencies. The underlying model calculates costs based on the principle of relative information cost.
For a detailed explanation of the methodology, please see the Cost Learning Model documentation page.
- Parameters:
pairs – An iterable of (ocr_string, ground_truth_string) tuples.
initial_model – Optional initial model used to align OCR outputs and ground truth strings. By default, an unweighted Levenshtein distance is used.
calculate_for_unseen – If True (and k > 0), pre-calculates costs for all possible edit operations based on the vocabulary. If False (default), only calculates costs for operations observed in the data.
- Returns:
A WeightedLevenshtein instance with the learned costs.
- with_smoothing(k: float) CostLearner [source]
Sets the smoothing parameter k.
This parameter controls how strongly the model defaults to a uniform probability distribution by adding a “pseudo-count” of k to every possible event.
- Parameters:
k – The smoothing factor, which must be a non-negative number.
- Returns:
The CostLearner instance for method chaining.
- Raises:
ValueError – If k < 0.
Notes
This parameter allows for a continuous transition between two modes:
k > 0 (recommended): This enables additive smoothing, with k = 1.0 being Laplace smoothing. It regularizes the model by assuming no event is impossible. The final costs are a measure of “relative surprisal,” normalized by the vocabulary size
k = 0: This corresponds to a normalized Maximum Likelihood Estimation. Probabilities are derived from the raw observed frequencies. The final costs are normalized using the same logic as the k > 0 case, making k=0 the continuous limit of the smoothed model. In this mode, costs can only be calculated for events observed in the training data. Unseen events will receive the default cost, regardless of the value of calculate_for_unseen in
fit()
.
- class ocr_stringdist.edit_operation.EditOperation(op_type: Literal['substitute', 'insert', 'delete', 'match'], source_token: str | None, target_token: str | None, cost: float)[source]
Represents a single edit operation (substitution, insertion, deletion or match).
- ocr_stringdist.matching.find_best_candidate(s: str, candidates: Iterable[str], distance_fun: Callable[[str, str], float], *, minimize: bool = True, early_return_value: float | None = None) tuple[str, float] [source]
Finds the best matching string from a collection of candidates based on a distance function.
Compares a given string against each string in the ‘candidates’ iterable using the provided ‘distance_fun’. It identifies the candidate that yields the minimum (or maximum, if minimize=False) distance.
- Parameters:
s (str) – The reference string to compare against.
candidates (Iterable[str]) – An iterable of candidate strings to compare with ‘s’.
distance_fun (Callable[[str, str], float]) – A function that takes two strings (s, candidate) and returns a float representing their distance or similarity.
minimize (bool) – If True (default), finds the candidate with the minimum distance. If False, finds the candidate with the maximum distance (useful for similarity scores).
early_return_value (Optional[float]) – If provided, the function will return immediately if a distance is found that is less than or equal to this value (if minimize=True) or greater than or equal to this value (if minimize=False). If None (default), all candidates are checked.
- Raises:
ValueError – If the ‘candidates’ iterable is empty.
- Returns:
A tuple containing the best matching candidate string and its calculated distance/score.
- Return type:
tuple[str, float]
Example:
from ocr_stringdist import find_best_candidate, WeightedLevenshtein wl = WeightedLevenshtein({("l", "I"): 0.1}) find_best_candidate("apple", ["apply", "apples", "orange", "appIe"], wl.distance) # ('appIe', 0.1)
- ocr_stringdist.default_ocr_distances.ocr_distance_map
Pre-defined distance map between characters, considering common OCR errors. The distances are between 0 and 1. This map is intended to be used with symmetric=True.
ocr_distance_map: dict[tuple[str, str], float] = {
("O", "0"): 0.1,
("l", "1"): 0.1,
("I", "1"): 0.15,
("o", "0"): 0.2,
("B", "8"): 0.25,
("S", "5"): 0.3,
("G", "6"): 0.3,
("Z", "2"): 0.3,
("C", "c"): 0.3,
("é", "e"): 0.3,
("Ä", "A"): 0.4,
("Ö", "O"): 0.4,
("Ü", "U"): 0.4,
("c", "e"): 0.4,
("a", "o"): 0.4,
("u", "v"): 0.4,
("i", "l"): 0.4,
("s", "5"): 0.4,
("m", "n"): 0.5,
("f", "s"): 0.5,
(".", ","): 0.5,
("2", "Z"): 0.5,
("t", "f"): 0.6,
("r", "n"): 0.6,
("-", "_"): 0.6,
("ß", "B"): 0.6,
("h", "b"): 0.7,
("v", "y"): 0.7,
("i", "j"): 0.7,
("é", "á"): 0.7,
("E", "F"): 0.8,
}