=====================
 Cost Learning Model
=====================

The ``CostLearner`` class calculates edit costs using a probabilistic model. The cost of an edit operation is defined by its **surprisal**: a measure of how unlikely that event is based on the training data. This value, derived from the negative log-likelihood :math:`-\log(P(e))`, quantifies the information contained in observing an event :math:`e`.

A common, high-probability error will have low surprisal and thus a low cost. A rare, low-probability error will have high surprisal and a high cost.

-------------------
Probabilistic Model
-------------------

The model estimates the probability of edit operations and transforms them into normalized, comparable costs. The smoothing parameter :math:`k` (set via ``with_smoothing()``) allows for a continuous transition between a Maximum Likelihood Estimation and a smoothed Bayesian model.

General Notation
~~~~~~~~~~~~~~~~

- :math:`c(e)`: The observed count of a specific event :math:`e`. For example, :math:`c(s \to t)` is the count of source character :math:`s` being substituted by target character :math:`t`.
- :math:`C(x)`: The total count of a specific context character :math:`x`. For example, :math:`C(s)` is the total number of times the source character :math:`s` appeared in the OCR outputs.
- :math:`V`: The total number of unique characters in the vocabulary.

Probability of an Edit Operation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The model treats all edit operations within the same probabilistic framework. An insertion is modeled as a substitution from a ground-truth character to an "empty" character, and a deletion is a substitution from an OCR character to an empty character.

This means that for any given character (either from the source or the target), there are :math:`V+1` possible outcomes: a transformation into any of the :math:`V` vocabulary characters or a transformation into an empty character.

The smoothed conditional probability for any edit event :math:`e` given a context character :math:`x` (where :math:`x` is a source character for substitutions/deletions or a target character for insertions) is:

.. math:: P(e|x) = \frac{c(e) + k}{C(x) + k \cdot (V+1)}


Bayesian Interpretation
~~~~~~~~~~~~~~~~~~~~~~~

When :math:`k > 0`, the parameter acts as the concentration parameter of a **symmetric Dirichlet prior distribution**. This represents a prior belief that every possible error is equally likely and has a "pseudo-count" of `k`.

Normalization
~~~~~~~~~~~~~

The costs are normalized by a ceiling :math:`Z` that depends on the size of the unified outcome space. It is the a priori surprisal of any single event, assuming a uniform probability distribution over all :math:`V+1` possible outcomes.

.. math:: Z = -\log(\frac{1}{V+1}) = \log(V+1)

This normalization contextualizes the cost relative to the complexity of the character set.

Final Cost
~~~~~~~~~~

The final cost :math:`w(e)` is the base surprisal scaled by the normalization ceiling:

.. math:: w(e) = \frac{-\log(P(e|x))}{Z}

This cost is a relative measure. Costs can be greater than 1.0, which indicates the observed event was less probable than the uniform a priori assumption.

Asymptotic Properties
~~~~~~~~~~~~~~~~~~~~~

As the amount of training data grows, the learned cost for an operation with a stable frequency ("share") converges to a fixed value - independent of :math:`k`:

.. math:: w(e) \approx \frac{-\log(\text{share})}{\log(V+1)}