2t0r — Full Stack Engineer

How I built a pipeline that takes a phone photo of any receipt and returns clean JSON — combining classical OCR with an LLM for extraction.

Receipts are a nightmare for computers: every store uses a different format, the fonts vary, the paper warps. Classical OCR gets you the text; the hard part is extracting structured meaning from it.

The pipeline

Image preprocessing: deskew, contrast enhancement, binarization
Tesseract OCR for raw text extraction
LangChain + GPT-4o-mini to extract structured fields
JSON schema validation and normalization

The LLM step is the key insight: instead of writing brittle regex patterns for every receipt format, you describe the schema you want and let the model figure out where the fields are. The prompt includes examples and the raw OCR text.

python

schema = {
  "merchant": "string",
  "date": "ISO 8601 date",
  "total": "number",
  "items": [{"name": "string", "price": "number"}]
}

result = chain.invoke({
  "ocr_text": raw_text,
  "schema": json.dumps(schema)
})

Accuracy and edge cases

On a test set of 200 real receipts, field extraction accuracy was 94% for merchant name, 97% for total, and 89% for individual line items. The failures were mostly handwritten receipts and severely wrinkled paper — areas where the preprocessing step still needs work.

The combination of deterministic preprocessing and probabilistic extraction is more robust than either approach alone.

OCR receipt parser: from photo to structured data

The pipeline

Accuracy and edge cases