End-to-End Workflow: Training Costs and Matching Products
This guide demonstrates a complete, production-oriented workflow for correcting OCR data. The process is split into two phases:
Phase 1 (One-Time Setup): We learn the specific OCR error costs from a sample of our own data. These learned costs are then saved, for example to a JSON file.
Phase 2 (Runtime Application): In our main application, we load the pre-trained costs and use an optimized batch process to quickly and accurately find the best match for new OCR scans from our product database.
Phase 1 (One-Time Setup): Learning Costs
First, we collect a representative set of (OCR string, correct ground truth) pairs. Using this data, we can train a WeightedLevenshtein instance to learn the probabilities of our specific OCR engine’s error patterns.
import json
from ocr_stringdist import WeightedLevenshtein
# A sample of observed OCR results and their correct counterparts.
training_data = [
("SKU-B0O-BTR", "SKU-800-BTR"), # B -> 8, O -> 0
("SKU-5A1-HIX", "SKU-5A1-MIX"), # H -> M
("SKU-B01-SGR", "SKU-B01-SGR"), # Include correct examples
# ... add more data for better results
]
wl_trained = WeightedLevenshtein.learn_from(training_data)
# Insertion/deletions costs are handled similarly
learned_costs = wl_trained.substitution_costs
print(f"Learned cost for ('B', '8'): {learned_costs.get(('B', '8')):.4f}")
print(f"Learned cost for ('O', '0'): {learned_costs.get(('O', '0')):.4f}")
# Save the learned costs to a file for later use in our application.
with open("ocr_costs.json", "w") as f:
json.dump(wl_trained.to_dict(), f, indent=2)
This saved ocr_costs.json file can, possibly after manual review, be deployed with your application.
Phase 2 (Runtime Application): Finding the Best Match
In our live application, we load the pre-computed costs at startup. When a new scan comes in, we use the highly optimized batch_distance method to find the best match efficiently.
The Scenario
We use the same product database as before and receive a new, imperfect OCR scan.
Product Code |
Description |
Price |
Sales Rank |
---|---|---|---|
SKU-800-BTR |
800W Power Blender |
119.95 |
1 |
SKU-B01-SGR |
Cold Press Juicer |
149.50 |
3 |
SKU-5A1-MIX |
5-Speed Hand Mixer |
49.99 |
2 |
Scanned Code: “SKU-B0O-BTR” (Errors: B instead of 8, O instead of 0)
Scanned Price: 119.95
import json
import math
from dataclasses import dataclass
from typing import Any
from ocr_stringdist import WeightedLevenshtein
# Setup: Load Data and Pre-trained Costs
@dataclass
class Product:
code: str
description: str
price: float
sales_rank: int
db_products = [
Product(code="SKU-800-BTR", description="800W Power Blender", price=119.95, sales_rank=1),
Product(code="SKU-B01-SGR", description="Cold Press Juicer", price=149.50, sales_rank=2),
Product(code="SKU-5A1-MIX", description="5-Speed Hand Mixer", price=49.99, sales_rank=3),
]
# Load configuration
with open("ocr_costs.json") as f:
wl = WeightedLevenshtein.from_dict(json.load(f))
# Correction Logic for a New Scan
ocr_code = "SKU-B0O-BTR"
ocr_price = 119.95
# Calculate all string distances in a single, optimized batch operation.
string_distances = wl.batch_distance(ocr_code, candidates=[p.code for p in db_products])
# Calculate other costs, like a price mismatch penalty.
price_penalties = [0.0 if p.price == ocr_price else 1.0 for p in db_products]
# Our source model: products sold rarely are a-priori less likely
source_costs = [math.log(p.sales_rank) for p in db_products]
# Combine costs to get a final score for each candidate.
total_costs = [d + p + s for d, p, s in zip(string_distances, price_penalties, source_costs)]
# Find the candidate with the minimum total cost.
min_cost = min(total_costs)
best_product = db_products[total_costs.index(min_cost)]
print(f"OCR Scan (Code): '{ocr_code}', (Price): {ocr_price}\n")
print(f"Best Match Found: {best_product}")
print(f"Confidence Score (Lower is Better): {min_cost:.2f}")
This workflow is efficient and robust: the heavy lifting of learning is done offline, and the runtime matching uses an optimized batch process to combine multiple sources of evidence (string similarity, price and sales rank) for an accurate result.