Fully local AI pipeline that extracts 29 invoice fields (including line items) from PDFs and scanned images — zero cloud, zero data leakage.
Indian SMBs process hundreds of invoices monthly — from digital PDFs to WhatsApp-photographed paper bills. Cloud OCR solutions (AWS Textract, Google Document AI) send financial documents to third-party servers. For businesses with confidential vendor relationships, GSTIN data, and payment terms, this is a non-starter. Local alternatives either lack vision OCR for scanned documents or require complex infrastructure.
Built a dual-path extraction pipeline with intelligent automatic routing. Text-extractable PDFs go through a fast text LLM path (~2s). Scanned PDFs and images fall back to a vision LLM using a robust Two-Pass approach: Pass 1 captures headers and totals, Pass 2 captures complex line items and tax summaries. This bypasses the context timeouts typical of local vision models on dense documents.
extractor.py: PyMuPDF attempts text extraction (≥120 char threshold). Text path → Qwen2.5:1.5b via Ollama generate API. Vision path → Qwen2.5-VL:3b via Ollama chat API using two distinct prompts mapped over up to 3 pages, keeping the highest-confidence merged result. Field validation layer: GST number regex validation (15-char GSTIN), ISO 8601 date normalisation (DD/MM/YYYY → YYYY-MM-DD), amount cleanup. Confidence score (0.0–1.0) weighted by field importance. SQLite schema with unique index stringifies line_items JSON. api.py: FastAPI 2.0 REST layer — single extract, batch (up to 20 files), paginated list, CSV export (UTF-8 BOM), stats endpoint. app.py: 4-tab Gradio UI — Extract (live JSON + line items table), History, Stats, About.
29 structured fields extracted per invoice (vs 10 in v2). Successfully isolates line-item rows (product, HSN/SAC, quantity, unit price, discounts, individual tax slabs). Multi-page two-pass vision OCR increases extraction depth without timing out, scaling to dense industrial invoices. Confidence scoring allows downstream systems to flag fuzzy extras. Zero bytes of sensitive data leave the machine.
