⚡ Architecture 04 · Intermediate

Document Processing

Automated document ingestion and structured extraction at scale. Parse PDFs, images, and raw text into clean data, then use LLMs to summarize, classify, and extract structured entities — producing reliable JSON output ready for downstream databases and APIs.

PDF & OCR Parsing Structured Output Multimodal Extraction Batch Processing
Open Notebook in Colab →
1

Architecture Overview

The Document Processing architecture automates the extraction of structured information from unstructured documents. It combines traditional parsing tools (PDF libraries, OCR engines) with LLM-powered extraction to convert messy real-world documents into clean, typed data structures.

This pattern is essential for any business that processes invoices, contracts, resumes, medical records, legal filings, or research papers at scale. The key insight is that LLMs excel at understanding document semantics, while traditional tools handle document mechanics (rendering, OCR, layout).

When to Use

  • Invoice and receipt processing (extract line items, totals, vendor info)
  • Contract analysis (identify clauses, obligations, dates, parties)
  • Resume parsing (extract skills, experience, education into structured profiles)
  • Medical record extraction (diagnoses, medications, lab results)
  • Compliance document review (flag missing sections, policy violations)
  • Batch processing of document backlogs for data migration

Complexity Level

Moderate to High. The parsing layer requires handling diverse document formats, OCR quality issues, and layout variations. The LLM extraction layer requires careful schema design and output validation. Error handling is critical because real-world documents are messy.

Tip

For documents with complex tables and layouts, consider multimodal models (send page images directly) rather than trying to extract text first. Vision models often handle formatting that text extraction completely mangles.

2

Architecture Diagram

Documents PDF Images TXT / DOCX Parser OCR / PyPDF Unstructured Document AI layout detection Chunker Section splitter Summarizer LLM summary Classifier LLM categorize Extractor LLM entities JSON Structured Database SQL / NoSQL

Architecture diagram — Document Processing: parse, chunk, extract, and store structured data

3

Components Deep Dive

Document Parsing Libraries

LibraryFormatStrengthsLimitations
PyPDF2 / pypdfPDF (text)Fast, lightweight, no dependenciesNo OCR, poor with scanned PDFs
pdfplumberPDF (tables)Excellent table extractionSlower, text-layer PDFs only
UnstructuredAll formatsUnified API, layout detectionHeavy dependency tree
TesseractImages/scansFree, widely supported OCRQuality varies, needs preprocessing
Google Document AIAll formatsBest-in-class OCR + layoutCloud-only, costs per page
Amazon TextractAll formatsTable + form extractionAWS-only, pricing per page
📄

PDF Parsing

Start with PyPDF2 for text-layer PDFs. Use pdfplumber for table-heavy docs. Fall back to OCR (Tesseract or Document AI) for scanned documents. Always check if the PDF has a text layer first.

📷

Multimodal Extraction

Send page images directly to vision-capable models (GPT-4V, Claude 3) for complex layouts. Bypasses OCR entirely. Especially powerful for forms, receipts, and handwritten documents.

📋

Structured Output

Use JSON mode or Pydantic models to enforce schema. Define expected fields, types, and validation rules upfront. Retry with error feedback when output doesn't match schema.

Batch Processing

Process documents in parallel with async/await or job queues (Celery, Bull). Implement progress tracking, error recovery, and partial result storage for large batches.

Error Handling

Documents fail in surprising ways: corrupted PDFs, password-protected files, unsupported encodings, empty pages. Build robust fallback chains and quarantine problematic files.

🔐

Data Validation

Validate extracted data against business rules: required fields present, dates are valid, amounts sum correctly, cross-reference entities. Flag low-confidence extractions for human review.

Multimodal vs. OCR Pipeline

For high-volume, simple documents (receipts, invoices), OCR + text extraction is cheaper. For complex, variable layouts (contracts, medical forms), multimodal models are more accurate and require less engineering. Calculate cost per page for your volume.

4

Implementation

Step 1: Parse a PDF Document

import pdfplumber
from pathlib import Path

def parse_pdf(file_path: str) -> list[dict]:
    """Extract text and tables from a PDF, page by page."""
    pages = []
    with pdfplumber.open(file_path) as pdf:
        for i, page in enumerate(pdf.pages):
            text = page.extract_text() or ""
            tables = page.extract_tables() or []

            # Convert tables to structured format
            parsed_tables = []
            for table in tables:
                if table and len(table) > 1:
                    headers = [h.strip() if h else "" for h in table[0]]
                    rows = [
                        dict(zip(headers, row))
                        for row in table[1:]
                    ]
                    parsed_tables.append(rows)

            pages.append({
                "page": i + 1,
                "text": text,
                "tables": parsed_tables,
                "has_content": bool(text.strip()),
            })
    return pages

Step 2: Define Extraction Schema with Pydantic

from pydantic import BaseModel, Field
from typing import Optional
from datetime import date

class LineItem(BaseModel):
    description: str = Field(..., description="Item description")
    quantity: int = Field(..., ge=1)
    unit_price: float = Field(..., ge=0)
    total: float = Field(..., ge=0)

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    invoice_date: date
    due_date: Optional[date] = None
    line_items: list[LineItem]
    subtotal: float
    tax: float = 0.0
    total: float
    currency: str = "USD"
    confidence: float = Field(..., ge=0, le=1, description="Extraction confidence")

Step 3: LLM Extraction with Structured Output

import anthropic
import json

client = anthropic.Anthropic()

def extract_invoice(document_text: str) -> Invoice:
    """Extract structured invoice data using LLM."""
    schema_json = json.dumps(Invoice.model_json_schema(), indent=2)

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=f"""You are a document extraction specialist.
Extract invoice data from the provided text and return valid JSON
matching this exact schema:

{schema_json}

Rules:
- Extract ALL line items found in the document
- Use ISO 8601 date format (YYYY-MM-DD)
- Set confidence between 0 and 1 based on extraction certainty
- If a field is unclear, make your best guess and lower confidence
- Return ONLY valid JSON, no markdown or explanation""",
        messages=[{
            "role": "user",
            "content": f"Extract invoice data:\n\n{document_text}"
        }],
        temperature=0.0,
    )

    # Parse and validate with Pydantic
    raw = json.loads(response.content[0].text)
    return Invoice.model_validate(raw)

Step 4: Multimodal Extraction (Send Image Directly)

import base64

def extract_from_image(image_path: str) -> Invoice:
    """Extract invoice data from an image using vision model."""
    with open(image_path, "rb") as f:
        img_b64 = base64.standard_b64encode(f.read()).decode()

    ext = Path(image_path).suffix.lower()
    media_type = {".png": "image/png", ".jpg": "image/jpeg"}.get(ext, "image/png")

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="Extract invoice data as JSON. Return only valid JSON.",
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64",
                    "media_type": media_type,
                    "data": img_b64,
                }},
                {"type": "text", "text": "Extract all invoice data from this image."}
            ]
        }],
        temperature=0.0,
    )

    raw = json.loads(response.content[0].text)
    return Invoice.model_validate(raw)

Step 5: Batch Processing Pipeline

import asyncio
from dataclasses import dataclass

@dataclass
class ProcessingResult:
    file_path: str
    status: str  # "success" | "error" | "low_confidence"
    data: Optional[Invoice] = None
    error: Optional[str] = None

async def process_batch(file_paths: list[str], concurrency=5) -> list[ProcessingResult]:
    """Process multiple documents with controlled concurrency."""
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def process_one(path):
        async with semaphore:
            try:
                pages = parse_pdf(path)
                text = "\n".join(p["text"] for p in pages if p["has_content"])
                invoice = extract_invoice(text)

                status = "success" if invoice.confidence >= 0.8 else "low_confidence"
                return ProcessingResult(path, status, data=invoice)
            except Exception as e:
                return ProcessingResult(path, "error", error=str(e))

    tasks = [process_one(p) for p in file_paths]
    results = await asyncio.gather(*tasks)
    return results
5

Data Flow

Step-by-step flow of a document through the processing pipeline:

  • 1. Document ingestion — File uploaded via API, watched folder, or S3 event trigger
  • 2. Format detection — Identify file type (PDF, image, DOCX) and choose appropriate parser
  • 3. Text extraction — Parse document using best available method (text layer, OCR, or multimodal)
  • 4. Preprocessing — Clean text, normalize whitespace, detect language, split into sections
  • 5. LLM extraction — Send cleaned text to LLM with structured output schema (parallel tasks possible)
  • 6. Validation — Validate extracted data against Pydantic schema and business rules
  • 7. Confidence routing — High confidence → auto-approve; Low confidence → human review queue
  • 8. Storage — Write structured JSON to database, archive original document
6

Trade-offs & Considerations

AdvantageLimitation
Handles diverse, unstructured document formatsOCR quality varies significantly with document quality
LLMs understand document semantics beyond keyword matchingCost per document can be high for multimodal processing
Structured output enforces consistent data schemasComplex tables and nested layouts still challenge LLMs
Scales with batch processing and job queuesLatency per document (seconds to minutes with OCR + LLM)
Confidence scoring enables human-in-the-loop reviewSensitive documents (medical, legal) need compliance review

Parsing Approach Comparison

ApproachCostAccuracySpeedBest For
Text extraction onlyFreeMediumFastClean, text-layer PDFs
OCR + Text LLMLowHighMediumScanned docs, standard layouts
Multimodal (image to LLM)HighHighestSlowComplex layouts, handwriting
Document AI (managed)MediumHighMediumHigh volume, standard forms
When to upgrade

If you need to answer questions over processed documents, feed the extracted data into Architecture 03 (RAG Pipeline). If processing requires multiple tool calls and decision-making, consider Architecture 06 (Agentic Tool Use).

7

Production Checklist

  • Build fallback chain: text extraction → OCR → multimodal for each document
  • Validate Pydantic schemas with edge cases (empty fields, unusual formats)
  • Implement confidence thresholds and human review queue for low-confidence extractions
  • Set up dead letter queue for persistently failing documents
  • Monitor extraction accuracy with labeled test sets (precision, recall per field)
  • Handle PII: mask or encrypt sensitive fields (SSN, credit cards) before storage
  • Implement idempotent processing (reprocessing same document produces same result)
  • Add progress tracking and estimated completion time for batch jobs
  • Archive original documents alongside extracted data for audit trails
  • Set cost alerts for multimodal processing (track per-page and per-document costs)
  • Test with adversarial documents (rotated pages, mixed languages, watermarks)
Previous Architecture
← 03 · RAG Pipeline
Next Architecture
05 · Multi-Model Router →