Brings local document extraction into Claude through a two-tier OCR pipeline (PaddleOCR/EasyOCR on CPU, GOT-OCR2.0/VLM on GPU for low-confidence fallback) paired with local LLMs via vLLM or Ollama. You define extraction schemas as Pydantic models or use eight built-in ones (invoices, receipts, Korean tax forms, bills of lading), and it returns validated JSON with confidence scores and cross-field checks like checkdigit verification and sum totals. The stdio transport exposes extract, ocr, validate, and batch commands. Reach for this when you need structured data from scanned documents without cloud APIs, or when you need custom validation rules like container number checksums or business registration verification baked into the extraction flow.
Document in, Structured JSON out. Locally. With your schema.
docpick is a lightweight, schema-driven document extraction pipeline that combines local OCR engines with local LLMs to extract structured JSON from any document — invoices, receipts, bills of lading, tax forms, and more.
pip install docpick # core (LLM extraction only)
pip install docpick[paddle] # + PaddleOCR (recommended)
pip install docpick[easyocr] # + EasyOCR (Korean-optimized)
pip install docpick[got] # + GOT-OCR2.0 (GPU, vision-language)
pip install docpick[all] # all OCR backends
Requirements: Python 3.11+ / LLM endpoint (vLLM, Ollama, or OpenAI-compatible)
from docpick import DocpickPipeline
from docpick.schemas import InvoiceSchema
pipeline = DocpickPipeline()
result = pipeline.extract("invoice.pdf", schema=InvoiceSchema)
print(result.data) # Structured dict matching schema
print(result.validation) # Validation errors/warnings
print(result.confidence) # Per-field confidence scores
# Extract structured data
docpick extract invoice.pdf --schema invoice --output result.json
# OCR only (no LLM)
docpick ocr document.png --lang ko,en
# Validate extracted JSON
docpick validate result.json --schema invoice
# Batch process a directory
docpick batch ./documents/ --schema invoice --output ./results/ --concurrency 4
# List available schemas
docpick schemas list
# Show schema details
docpick schemas show invoice
| Schema | Document Type | Key Validations |
|---|---|---|
invoice | Commercial invoices | Line item sums, tax ID checkdigit, date order |
receipt | Retail/restaurant receipts | Total = subtotal + tax + tip |
bill_of_lading | Ocean/air B/L | Container weight sums, ISO 6346, HS code format |
purchase_order | Purchase orders | PO total = line items, delivery date order |
kr_tax_invoice | Korean e-tax invoice (세금계산서) | Business number checkdigit (x2), supply/tax/total sums |
bank_statement | Bank statements | IBAN mod97, period date order |
id_document | Passport/ID (ICAO 9303) | MRZ, ISO 3166 country codes, date ranges |
certificate_of_origin | Certificate of Origin | ISO 3166 alpha-2 country codes |
Define your own schema with Pydantic:
from pydantic import BaseModel
from docpick import DocpickPipeline
from docpick.validation.rules import SumEqualsRule, RequiredFieldRule
class MyDocument(BaseModel):
"""Custom document schema."""
company_name: str | None = None
total_amount: float | None = None
tax_amount: float | None = None
net_amount: float | None = None
items: list[dict] | None = None
class ValidationRules:
rules = [
RequiredFieldRule("company_name"),
SumEqualsRule(["net_amount", "tax_amount"], "total_amount"),
]
pipeline = DocpickPipeline()
result = pipeline.extract("my_document.pdf", schema=MyDocument)
Or use a JSON Schema file:
docpick extract document.pdf --schema my_schema.json
| Algorithm | Use Case |
|---|---|
kr_business_number | Korean business registration number (10 digits) |
luhn | Credit card numbers |
iso_6346 | Shipping container numbers |
iban_mod97 | International bank account numbers |
awb_mod7 | Air waybill numbers |
mrz | Machine Readable Zone (passport/ID) |
| Rule | Description |
|---|---|
SumEqualsRule | Sum of fields equals target (with tolerance) |
DateBeforeRule | Date A must precede Date B |
RequiredFieldRule | Field must be non-null and non-empty |
FieldEqualsRule | Two fields must be equal |
RangeRule | Numeric field within min/max bounds |
RegexRule | Field matches regex pattern |
Validate consistency across related documents (e.g., Invoice + B/L + Packing List):
from docpick.validation.cross_document import create_trade_document_validator
validator = create_trade_document_validator()
result = validator.validate({
"invoice": invoice_data,
"bl": bl_data,
"packing_list": packing_list_data,
"certificate": certificate_data,
})
print(result.is_valid)
| Engine | Type | GPU | Languages | Best For |
|---|---|---|---|---|
| PaddleOCR | Traditional OCR | Optional | 111 | General documents (default) |
| EasyOCR | Traditional OCR | Optional | 80+ | Korean text |
| GOT-OCR2.0 | Vision-Language | Required | Multi | Complex layouts |
| VLM | Vision-Language | Required | Multi | Direct image → JSON |
The default auto engine uses confidence-based fallback:
If Tier 1 average confidence falls below threshold (default 0.7), automatically escalates to Tier 2.
| Provider | Endpoint | Default Model |
|---|---|---|
| vLLM | http://localhost:8000/v1 | Qwen/Qwen3.5-32B-AWQ |
| Ollama | http://localhost:11434 | qwen3.5:7b |
Configure via CLI or YAML:
docpick config set llm.provider ollama
docpick config set llm.base_url http://localhost:11434
docpick config set llm.model qwen3.5:7b
The pipeline is designed to be resilient:
result.errorsresult = pipeline.extract("damaged.pdf", schema=InvoiceSchema)
if result.errors:
print("Pipeline warnings:", result.errors)
if result.data:
print("Partial extraction:", result.data)
Process entire directories with parallel workers:
from docpick.batch import BatchProcessor
from docpick.schemas import InvoiceSchema
processor = BatchProcessor(concurrency=4)
result = processor.process_directory(
"./invoices/",
schema=InvoiceSchema,
recursive=True,
)
print(f"Processed {result.succeeded}/{result.total} files")
for path, extraction in result.results.items():
print(f"{path}: {extraction.data.get('total_amount')}")
flowchart TD
A["📄 Document\n(PDF / Image)"] --> B["DocumentLoader\n(pypdfium2)"]
B --> C["Tier 1: OCR\n(PaddleOCR / EasyOCR)\nCPU"]
C --> D{"Confidence\n≥ threshold?"}
D -->|"yes"| F["LLM Extractor\n(vLLM / Ollama)\nSchema prompt"]
D -->|"no"| E["Tier 2: VLM\n(GOT / VLM)\nGPU"]
E --> F
F --> G["Pydantic Validation"]
G --> H["✅ ExtractionResult"]
Apache 2.0 — all dependencies are Apache 2.0 or MIT licensed.
Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.
io.github.ericm1018/skillfm-llm-cost-optimizer-openai-anthropic-usage
io.github.mikerawsonnz/llm-orchestration-agent
io.github.mikerawsonnz/authenticated-llm-agent
labforgedev/copilot-memory-mcp
csoai-org/agent-prompt-injection-firewall-mcp
io.github.mikerawsonnz/authenticated-multi-llm-agent