Built by the Sunholo team behind AILANG, this parser extracts structured content from Office documents and PDFs with unusual precision. The deterministic XML approach captures track changes, interleaved comments, headers, footers, and merged cells that most parsers miss. Office formats run locally with zero AI. PDFs and images delegate to whatever model you configure (Gemini, Claude, local Ollama). Outputs JSON and markdown, runs via stdio or HTTP. The team benchmarked it against Pandoc, Docling, and six others on 69 files across 11 formats and scored 93.9% composite. Reach for this when you need redlining metadata, speaker notes from PPTX, or multi-sheet XLSX data without fighting raw OOXML yourself.
Public tool metadata for what this MCP can expose to an agent.
parse_searchFind brands, organic AI prompts, citation sources, and market niches for marketer research. Use this first when the user names a brand, category, source, or AI visibility question.3 paramsFind brands, organic AI prompts, citation sources, and market niches for marketer research. Use this first when the user names a brand, category, source, or AI visibility question.
limitnumberquerystringtypesarrayparse_get_brandFetch a concise public marketing brief for one brand, including Parse score, strengths, weak spots, top prompts, citation sources, related brands, and next research questions.1 paramsFetch a concise public marketing brief for one brand, including Parse score, strengths, weak spots, top prompts, citation sources, related brands, and next research questions.
slug_or_idstringparse_get_promptFetch one public organic prompt by slug when the user wants to inspect the exact AI-search question behind a result.1 paramsFetch one public organic prompt by slug when the user wants to inspect the exact AI-search question behind a result.
slugstringparse_get_statsExplain the public Parse index scale and freshness: tracked brands, organic prompts, and citation observations.Explain the public Parse index scale and freshness: tracked brands, organic prompts, and citation observations.
No parameter schema in public metadata yet.
searchCompatibility alias for parse_search. Use for clients that expect a generic search tool.2 paramsCompatibility alias for parse_search. Use for clients that expect a generic search tool.
limitnumberquerystringfetchCompatibility alias that resolves fetch IDs like brand:stripe or prompt:best-crm into JSON-text results with human-readable text.1 paramsCompatibility alias that resolves fetch IDs like brand:stripe or prompt:best-crm into JSON-text results with human-readable text.
idstringUniversal document parsing in AILANG. Extracts structured content from DOCX, PPTX, XLSX, PDF, and image files into JSON and markdown.
Office formats (DOCX, PPTX, XLSX) use deterministic XML parsing — no AI, no cloud, instant results. PDFs default to the deterministic pdftotext backend (poppler) — also no AI, no cloud — with docling and liteparse as local alternatives and pluggable AI (Gemini, Claude, local Ollama) for scanned/image-only pages via --pdf-backend ai. Images delegate to whatever AI model you plug in. AILANG Parse is AI-agnostic: swap --pdf-backend/--ai to change the backend, zero code changes.
Requires AILANG CLI.
# Clone and symlink
git clone https://github.com/sunholo-data/ailang-parse.git
ln -s "$(pwd)/ailang-parse/bin/docparse" /usr/local/bin/docparse
Use AILANG Parse from your language of choice:
pip install ailang-parse # Python
npm install @ailang/parse # JavaScript/TypeScript
go get github.com/sunholo-data/ailang-parse-go # Go
# Office documents (deterministic, no AI needed)
docparse report.docx
docparse slides.pptx
docparse spreadsheet.xlsx
# PDF (deterministic pdftotext by default — no AI); images (AI auto-enabled)
docparse document.pdf
docparse photo.png
# Options
docparse report.docx describe # AI image descriptions
docparse report.docx summarize # AI document summary
docparse contract.pdf # PDF: deterministic pdftotext (default)
docparse scan.pdf --pdf-backend ai --ai gemini-2.5-flash # Scanned PDF needs AI
# Format conversion
docparse report.docx --convert output.html
docparse data.csv --convert report.docx
docparse notes.md --convert slides.pptx
# AI document generation
ailang run --entry main --caps IO,FS,Env,AI --ai gemini-2.5-flash \
docparse/main.ail --generate report.docx --prompt "Q1 sales report with tables"
Every run produces:
docparse/data/output.json — Structured JSON with typed blocksdocparse/data/output.md — LLM-ready markdown| Feature | DOCX | PPTX | XLSX | Best Competitor |
|---|---|---|---|---|
| Tables with merged cells | Yes | Yes | Yes | Raw OOXML only |
| Track changes (redlining) | Yes | — | — | Pandoc (3/3) |
| Comments (interleaved) | Yes | — | — | Raw OOXML (2/2) |
| Headers/footers | Yes | — | — | Kreuzberg (2/3) |
| Text boxes / VML shapes | Yes | Yes | — | Raw OOXML (1/2) |
| Equations (§22.1) | Yes | — | — | None |
| Field codes (§17.16) | Yes | — | — | Kreuzberg, OOXML |
| Speaker notes | — | Yes | — | None |
| Multi-sheet extraction | — | — | Yes | Kreuzberg |
OfficeDocBench (69 files, 11 formats, 7 metrics): AILANG Parse 93.9% composite with 100% coverage vs nearest competitor 68.0% coverage-adjusted. 8 parsers compared including Raw OOXML, Pandoc, Kreuzberg, MarkItDown, Unstructured, Docling. Scores include aspirational ECMA-376 spec targets that intentionally lower our score.
Parsing (16 formats): DOCX, PPTX, XLSX, ODT, ODP, ODS, HTML, Markdown, CSV, EPUB, EML, MBOX, TEX, RTF, PDF, images (JPG/PNG)
Generation (9 formats): DOCX, PPTX, XLSX, ODT, ODP, ODS, HTML, Markdown, QMD (Quarto)
docparse/
├── types/document.ail # Block ADT (9 variants)
├── services/
│ ├── format_router.ail # Format detection (36 inline tests)
│ ├── zip_extract.ail # ZIP layer (9 inline tests)
│ ├── docx_parser.ail # DOCX XML → Blocks (6 inline tests)
│ ├── pptx_parser.ail # PPTX slides → Blocks
│ ├── xlsx_parser.ail # XLSX worksheets → Blocks
│ ├── direct_ai_parser.ail # PDF/image → Blocks (AI)
│ ├── layout_ai.ail # AI self-healing (optional)
│ ├── output_formatter.ail # JSON + markdown output
│ └── docparse_browser.ail # WASM browser adapter
└── main.ail # CLI entry point
28+ contracts, 50+ inline tests.
AILANG Parse uses AILANG's AI effect — any model AILANG supports works:
docparse scan.pdf --ai gemini-2.5-flash # Google (default; fast)
docparse scan.pdf --ai gemini-3-flash-preview # Google (slower; thinking model)
docparse scan.pdf --ai granite-docling # Local Ollama (free)
docparse scan.pdf --ai claude-haiku-4-5 # Anthropic
AI usage is bounded by capability budgets (AI @limit=30), so costs are predictable.
docparse --check # Type-check all modules
docparse --test # Run inline tests
docparse --prove # Static Z3 contract verification
uv run benchmarks/run_benchmarks.py --suite office # Structural (no API, instant)
uv run benchmarks/run_benchmarks.py --suite pdf # PDF extraction (needs AI)
uv run benchmarks/run_benchmarks.py --competitors # Compare to Docling etc.
See benchmarks/ for details.
Apache 2.0
DOCPARSE_API_KEYsecretAILANG Parse API key (dp_...). Optional — the bridge auto-loads keys saved at ~/.config/ailang-parse/credentials.json. Get one from https://www.sunholo.com/ailang-parse/
AILANG_PARSE_MCP_URLOverride the hosted MCP endpoint. Defaults to https://docparse.ailang.sunholo.com/mcp/
csoai-org/pdf-document-mcp
xt765/mcp-document-converter
io.github.xjtlumedia/markdown-formatter
io.github.ai-aviate/better-notion
suekou/mcp-notion-server
meterlong/mcp-doc