Cost Learning Model

The CostLearner class calculates edit costs using a probabilistic model. The cost of an edit operation is defined by its surprisal: a measure of how unlikely that event is based on the training data. This value, derived from the negative log-likelihood \(-\log(P(e))\), quantifies the information contained in observing an event \(e\).

A common, high-probability error will have low surprisal and thus a low cost. A rare, low-probability error will have high surprisal and a high cost.

Probabilistic Model

The model estimates the probability of edit operations and transforms them into normalized, comparable costs. The smoothing parameter \(k\) (set via with_smoothing()) allows for a continuous transition between a Maximum Likelihood Estimation and a smoothed Bayesian model.

General Notation

\(c(e)\): The observed count of a specific event \(e\). For example, \(c(s \to t)\) is the count of source character \(s\) being substituted by target character \(t\).
\(C(x)\): The total count of a specific context character \(x\). For example, \(C(s)\) is the total number of times the source character \(s\) appeared in the OCR outputs.
\(V\): The total number of unique characters in the vocabulary.

Probability of an Edit Operation

The model treats all edit operations within the same probabilistic framework. An insertion is modeled as a substitution from a ground-truth character to an “empty” character, and a deletion is a substitution from an OCR character to an empty character.

This means that for any given character (either from the source or the target), there are \(V+1\) possible outcomes: a transformation into any of the \(V\) vocabulary characters or a transformation into an empty character.

The smoothed conditional probability for any edit event \(e\) given a context character \(x\) (where \(x\) is a source character for substitutions/deletions or a target character for insertions) is:

\[P(e|x) = \frac{c(e) + k}{C(x) + k \cdot (V+1)}\]

Bayesian Interpretation

When \(k > 0\), the parameter acts as the concentration parameter of a symmetric Dirichlet prior distribution. This represents a prior belief that every possible error is equally likely and has a “pseudo-count” of k.

Normalization

The costs are normalized by a ceiling \(Z\) that depends on the size of the unified outcome space. It is the a priori surprisal of any single event, assuming a uniform probability distribution over all \(V+1\) possible outcomes.

\[Z = -\log(\frac{1}{V+1}) = \log(V+1)\]

This normalization contextualizes the cost relative to the complexity of the character set.

Final Cost

The final cost \(w(e)\) is the base surprisal scaled by the normalization ceiling:

\[w(e) = \frac{-\log(P(e|x))}{Z}\]

This cost is a relative measure. Costs can be greater than 1.0, which indicates the observed event was less probable than the uniform a priori assumption.

Asymptotic Properties

As the amount of training data grows, the learned cost for an operation with a stable frequency (“share”) converges to a fixed value - independent of \(k\):

\[w(e) \approx \frac{-\log(\text{share})}{\log(V+1)}\]