← Back to articles
·6 min read·
#Python#AI#OCR#LangChain

OCR receipt parser: from photo to structured data

How I built a pipeline that takes a phone photo of any receipt and returns clean JSON — combining classical OCR with an LLM for extraction.

Receipts are a nightmare for computers: every store uses a different format, the fonts vary, the paper warps. Classical OCR gets you the text; the hard part is extracting structured meaning from it.

The pipeline

  • Image preprocessing: deskew, contrast enhancement, binarization
  • Tesseract OCR for raw text extraction
  • LangChain + GPT-4o-mini to extract structured fields
  • JSON schema validation and normalization

The LLM step is the key insight: instead of writing brittle regex patterns for every receipt format, you describe the schema you want and let the model figure out where the fields are. The prompt includes examples and the raw OCR text.

python
schema = {
  "merchant": "string",
  "date": "ISO 8601 date",
  "total": "number",
  "items": [{"name": "string", "price": "number"}]
}

result = chain.invoke({
  "ocr_text": raw_text,
  "schema": json.dumps(schema)
})

Accuracy and edge cases

On a test set of 200 real receipts, field extraction accuracy was 94% for merchant name, 97% for total, and 89% for individual line items. The failures were mostly handwritten receipts and severely wrinkled paper — areas where the preprocessing step still needs work.

The combination of deterministic preprocessing and probabilistic extraction is more robust than either approach alone.

← Back to articles