Cost Learning Model
The CostLearner
class calculates edit costs using a probabilistic model. The cost of an edit operation is defined by its surprisal: a measure of how unlikely that event is based on the training data. This value, derived from the negative log-likelihood \(-\log(P(e))\), quantifies the information contained in observing an event \(e\).
A common, high-probability error will have low surprisal and thus a low cost. A rare, low-probability error will have high surprisal and a high cost.
Probabilistic Model
The model estimates the probability of edit operations and transforms them into normalized, comparable costs. The smoothing parameter \(k\) (set via with_smoothing()
) allows for a continuous transition between a Maximum Likelihood Estimation and a smoothed Bayesian model.
General Notation
\(c(e)\): The observed count of a specific event \(e\). For example, \(c(s \to t)\) is the count of source character \(s\) being substituted by target character \(t\).
\(C(x)\): The total count of a specific context character \(x\). For example, \(C(s)\) is the total number of times the source character \(s\) appeared in the OCR outputs.
\(V\): The total number of unique characters in the vocabulary.
Probability of an Edit Operation
The model treats all edit operations within the same probabilistic framework. An insertion is modeled as a substitution from a ground-truth character to an “empty” character, and a deletion is a substitution from an OCR character to an empty character.
This means that for any given character (either from the source or the target), there are \(V+1\) possible outcomes: a transformation into any of the \(V\) vocabulary characters or a transformation into an empty character.
The smoothed conditional probability for any edit event \(e\) given a context character \(x\) (where \(x\) is a source character for substitutions/deletions or a target character for insertions) is:
Bayesian Interpretation
When \(k > 0\), the parameter acts as the concentration parameter of a symmetric Dirichlet prior distribution. This represents a prior belief that every possible error is equally likely and has a “pseudo-count” of k.
Normalization
The costs are normalized by a ceiling \(Z\) that depends on the size of the unified outcome space. It is the a priori surprisal of any single event, assuming a uniform probability distribution over all \(V+1\) possible outcomes.
This normalization contextualizes the cost relative to the complexity of the character set.
Final Cost
The final cost \(w(e)\) is the base surprisal scaled by the normalization ceiling:
This cost is a relative measure. Costs can be greater than 1.0, which indicates the observed event was less probable than the uniform a priori assumption.
Asymptotic Properties
As the amount of training data grows, the learned cost for an operation with a stable frequency (“share”) converges to a fixed value - independent of \(k\):