Macos Vision Mcp

STDIOregistry active

Summary

Wraps Apple's Vision Framework to run OCR, face detection, barcode reading, and image classification entirely on your Mac. Instead of sending a 44-page PDF to Claude as 73,500 tokens, it extracts structured text locally first (paragraphs, bounding boxes, reading order) and costs around 2,400 tokens. Works with any MCP client over stdio. Exposes five tools: ocr_image for text extraction, detect_faces, detect_barcodes, classify_image, and analyze_document for full pipelines that return JSON ready for the model to rebuild as Markdown or HTML. Requires macOS 13.0+ and runs offline after install. Useful when you're processing contracts, invoices, or medical records and want the file to stay local while still getting structured output the LLM can reason over.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

macos-vision-mcp

macos-vision-mcp — local, private, offline OCR for MCP-compatible LLMs

Cut document token costs by ~97% with local, private, offline OCR for any MCP client — no API keys, no uploads.

Pre-extracts text and image data locally before your AI ever sees it — cutting token usage by ~97% on real documents and returning structured paragraphs, lines, and bounding boxes so the model can reconstruct the document into Markdown, HTML, DOCX, or any other format. Files never leave your Mac: no cloud API, no API keys, no network requests.

_{How the ~97% is measured: a 44-page scanned PDF sent as page images costs ~73,500 tokens; the same file run through analyze_document returns ~2,400 tokens of extracted text and structure (raw page-image tokens vs. extracted-text tokens). Your numbers vary with page density and tokenizer — treat 97% as the order of magnitude, not a guarantee.}

Contents: Quick Start · What you get · Why it's different · Available Tools · Usage · Example workflows · Configuration · Privacy layer

What you get

OCR for images and PDFs (JPG, PNG, HEIC, TIFF, multi-page PDF) via Apple Vision Framework.
~97% token reduction: a 44-page PDF costs ~2,400 tokens instead of ~73,500.
Reading-order paragraphs + raw text blocks with bounding boxes — rich structure for the model to reconstruct the document into any output format (Markdown, HTML, DOCX, JSON), not a lossy plain-text dump.
Face detection, barcode/QR reading, and image classification — all on-device.
Full document pipeline: OCR + faces + barcodes + rectangles in a single tool call.
Works with Claude Code, Claude Desktop, and Cursor — any MCP-compatible client.
No files uploaded to any server — processing stays entirely on your Mac.
100% offline after npm install — powered by Apple Vision Framework, same engine as Live Text in Photos.app.

❌ Without / ✅ With

❌ Without macos-vision-mcp:

Sending a 44-page PDF costs ~73,500 tokens
Every image, invoice, or contract goes through a cloud API
Sensitive documents leave your machine on every request

✅ With macos-vision-mcp:

Local Apple Vision pre-extracts text before Claude ever sees it
~2,400 tokens for the same 44-page PDF — 97% fewer
Files never leave your Mac

Why it's different

Most OCR options for LLMs either ship your documents to a cloud vision API or make you stand up and tune your own engine. This runs on Apple's on-device Vision framework — the same engine behind Live Text in Photos.app — so extraction is free, private, and instant.

	macos-vision-mcp	Cloud vision OCR (GPT-4o, Google Vision, Mistral OCR)	Tesseract-based MCP
Cost	$0 — no per-page or per-token fees	Per-call / per-page billing	$0, but self-hosted
Offline	Yes, after install	No — every page hits the network	Yes
Privacy	Files never leave your Mac	Documents uploaded to a third party	Local
Setup	One command, no keys	API key + billing account	Install + language data + tuning
Quality	Apple Vision (strong on clean scans, receipts, screenshots)	Generally high	Varies; weaker on poor scans

The trade-off is honest: it's macOS-only, and on heavily skewed or low-contrast scans a cloud model may still read more. For the common case — invoices, contracts, receipts, screenshots, clean PDFs — you get cloud-grade extraction with zero cost, zero setup, and nothing leaving your machine.

Privacy layer

macos-vision-mcp acts as a local pre-processing layer between your documents and the cloud. Useful for:

Legal documents, contracts, NDAs
Financial reports, invoices, internal spreadsheets
Medical records or any GDPR-sensitive content
Any situation where you want to extract structured data locally before deciding what (if anything) to send upstream

Instead of sending the raw document to your AI, you extract the text and structure locally first. The model then works only with the extracted text — never the original file.

Quick Start

Add to your MCP client (example for Claude Code):

claude mcp add macos-vision-mcp -- npx -y macos-vision-mcp

Using Claude Desktop or Cursor? Jump to Configuration ↓

Restart your client. npx fetches the package on first run, caches it, and the tools appear automatically — no separate install step. This is the convention used by most MCP servers and recommended by Anthropic, Cursor, and other clients.

Note: On first run, the package downloads prebuilt Swift helper binaries (vision-helper, pdf-helper) from its GitHub Releases (~300 KB, ~1–2s). Subsequent invocations hit the npx cache and start instantly. Xcode Command Line Tools are only required as a fallback when the download can't reach the network — set MACOS_VISION_SKIP_DOWNLOAD=1 to force local compilation with swiftc.

Prefer instant cold-starts (no npx cache lookup)? Install globally with npm install -g macos-vision-mcp and use the alternative config shown at the bottom of Configuration.

Available Tools

Tool	What it does	Example prompt
`ocr_image`	Extract text from an image or PDF (JPG, PNG, HEIC, TIFF, PDF). Returns plain text, or per-page paragraphs + text blocks with `lineId` / `paragraphId` and bounding boxes. Accepts `start_page` / `max_pages` for partial PDF OCR.	"Read the text from ~/Desktop/screenshot.png"
`detect_faces`	Detect human faces and return their count and positions.	"How many people are in this photo?"
`detect_barcodes`	Read QR codes, EAN, UPC, Code128, PDF417, Aztec, and other 1D/2D codes.	"What does the QR code in /tmp/qr.jpg say?"
`detect_document`	Detect the four corner points of a document in a photo (paper, receipt, ID). Useful as a crop / deskew hint before OCR.	"Find the document corners in ~/Desktop/receipt.jpg"
`classify_image`	Classify image content into 1000+ categories with confidence scores.	"What is in this image?"
`analyze_document`	Returns structured JSON with reading-order paragraphs, raw text blocks (bbox / confidence), faces, barcodes, and rectangles — ready for the model to reconstruct into Markdown, HTML, or anything else. Also accepts `start_page` / `max_pages` for long PDFs.	"Reconstruct ~/Desktop/scan.pdf as clean Markdown"

Usage

Use the tool name explicitly in your prompt to guarantee local processing:

Extract text from an image or PDF:

Use ocr_image to extract text from ~/Desktop/invoice.pdf

Detect faces in a photo:

Use detect_faces on ~/Photos/team.jpg and tell me how many people are in it

Classify image content:

Use classify_image on ~/Downloads/unknown.jpg

Full document analysis + reconstruction:

Use analyze_document on ~/Desktop/report.pdf and reconstruct it as clean Markdown

The tool returns structured JSON; the model picks the output format you ask for (Markdown, HTML, DOCX outline, etc.) without any extra dependencies — no Ollama, no cloud LLM, no extra tooling.

Example workflows

Real-world combinations that work out of the box once the server is connected:

"Convert PDF → clean Markdown for LLM" — analyze_document returns reading-order paragraphs and bounding boxes; the model renders Markdown ready to drop into a docs site, knowledge base, or RAG pipeline.
"Extract invoice data locally before sending to GPT" — pull line items, totals, vendor, and dates from the PDF locally with analyze_document, then send only the structured JSON upstream. The original document never leaves your Mac.
"Scan receipts → JSON → expense tracker" — ocr_image on a phone photo, the model normalizes amount / date / merchant, and pipes the result straight into your expense tool's API.
"Decode a QR code from a screenshot" — detect_barcodes returns the decoded value plus symbology in one round trip.
"Crop a photo of a paper form before OCR" — detect_document returns the four corner points so you (or a downstream tool) can deskew and crop the image before reading the text.

Output schema (analyze_document)

{
  "source": { "path": "...", "pageCount": 1, "isPdf": false },
  "pages": [
    {
      "page": 0,
      // primary surface for reconstruction — reading-order paragraphs joined with "\n"
      "paragraphs": [
        { "paragraphId": 0, "lineIds": [0], "text": "ACME COFFEE" },
        { "paragraphId": 1, "lineIds": [1, 2], "text": "12 Main St\nPortland, OR" },
      ],
      // spatial fallback — raw blocks with page-local 0–1 bbox, confidence, line/paragraph membership
      "textBlocks": [
        {
          "text": "ACME COFFEE",
          "lineId": 0,
          "paragraphId": 0,
          "confidence": 0.99,
          "bbox": { "x": 0.21, "y": 0.04, "width": 0.58, "height": 0.06 },
        },
      ],
      "faces": [],
      "barcodes": [],
      "rectangles": [],
    },
  ],
  "summary": {
    "totalTextBlocks": 8,
    "totalParagraphs": 2,
    "totalFaces": 0,
    "totalBarcodes": 0,
    "totalRectangles": 0,
  },
}

Use paragraphs[].text for the 95% case (rebuild Markdown/HTML/plain text directly). Reach for textBlocks[] when you need spatial context — multi-column layouts, tables, forms, IDs.

Notes:

ocr_image in blocks mode returns the same per-page shape minus the detection sections: { pages: [{ page, paragraphs, textBlocks }] }.
PDFs are processed page by page. All coordinates are page-local (0–1), and paragraphId / lineId reset on every page.
Face, barcode, and rectangle detection on PDFs is best-effort — the underlying binary analyzes the file as a whole rather than per page, so any detections returned are attached to page 0 only.
Paragraph grouping uses spatial heuristics. For multi-column layouts (magazine spreads, wiki pages with side panels) the heuristic can collapse the whole page into a single paragraph. When that happens, fall back to textBlocks[] and reconstruct from the bounding boxes.

Configuration

All examples below use npx -y — the recommended default. No prior npm install needed; the package is fetched and cached on first run, and updates pick up automatically when the npx cache rolls over.

Claude Code

claude mcp add macos-vision-mcp -- npx -y macos-vision-mcp

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "macos-vision-mcp": {
      "command": "npx",
      "args": ["-y", "macos-vision-mcp"]
    }
  }
}

Cursor

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "macos-vision-mcp": {
      "command": "npx",
      "args": ["-y", "macos-vision-mcp"]
    }
  }
}

Alternative: global install

If you'd rather skip the npx cache lookup on cold starts — or you want to pin a specific version — install once:

npm install -g macos-vision-mcp

…then use "command": "macos-vision-mcp" (no args) in any of the JSON configs above, or claude mcp add macos-vision-mcp -- macos-vision-mcp for Claude Code. Note that global installs can break when switching Node versions with nvm / asdf / volta — re-run npm install -g after switching.

Support

If macos-vision-mcp saved you tokens or kept a document on your Mac, consider starring the repo — it helps others find it.

Contributing

Contributions are welcome. Please follow Conventional Commits for commit messages — this project uses release-it with @release-it/conventional-changelog to automate releases.

git clone <repo>
cd macos-vision-mcp
npm install
npm run dev   # watch mode

License

MIT — Adrian Wolczuk

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Registryactive

Packagemacos-vision-mcp

TransportSTDIO

UpdatedApr 10, 2026

View on GitHub

macos-vision-mcp

macos-vision-mcp — local, private, offline OCR for MCP-compatible LLMs

Cut document token costs by ~97% with local, private, offline OCR for any MCP client — no API keys, no uploads.

_{How the ~97% is measured: a 44-page scanned PDF sent as page images costs ~73,500 tokens; the same file run through analyze_document returns ~2,400 tokens of extracted text and structure (raw page-image tokens vs. extracted-text tokens). Your numbers vary with page density and tokenizer — treat 97% as the order of magnitude, not a guarantee.}

Contents: Quick Start · What you get · Why it's different · Available Tools · Usage · Example workflows · Configuration · Privacy layer

What you get

OCR for images and PDFs (JPG, PNG, HEIC, TIFF, multi-page PDF) via Apple Vision Framework.
~97% token reduction: a 44-page PDF costs ~2,400 tokens instead of ~73,500.
Reading-order paragraphs + raw text blocks with bounding boxes — rich structure for the model to reconstruct the document into any output format (Markdown, HTML, DOCX, JSON), not a lossy plain-text dump.
Face detection, barcode/QR reading, and image classification — all on-device.
Full document pipeline: OCR + faces + barcodes + rectangles in a single tool call.
Works with Claude Code, Claude Desktop, and Cursor — any MCP-compatible client.
No files uploaded to any server — processing stays entirely on your Mac.
100% offline after npm install — powered by Apple Vision Framework, same engine as Live Text in Photos.app.

❌ Without / ✅ With

❌ Without macos-vision-mcp:

Sending a 44-page PDF costs ~73,500 tokens
Every image, invoice, or contract goes through a cloud API
Sensitive documents leave your machine on every request

✅ With macos-vision-mcp:

Local Apple Vision pre-extracts text before Claude ever sees it
~2,400 tokens for the same 44-page PDF — 97% fewer
Files never leave your Mac

Why it's different

	macos-vision-mcp	Cloud vision OCR (GPT-4o, Google Vision, Mistral OCR)	Tesseract-based MCP
Cost	$0 — no per-page or per-token fees	Per-call / per-page billing	$0, but self-hosted
Offline	Yes, after install	No — every page hits the network	Yes
Privacy	Files never leave your Mac	Documents uploaded to a third party	Local
Setup	One command, no keys	API key + billing account	Install + language data + tuning
Quality	Apple Vision (strong on clean scans, receipts, screenshots)	Generally high	Varies; weaker on poor scans

Privacy layer

macos-vision-mcp acts as a local pre-processing layer between your documents and the cloud. Useful for:

Legal documents, contracts, NDAs
Financial reports, invoices, internal spreadsheets
Medical records or any GDPR-sensitive content
Any situation where you want to extract structured data locally before deciding what (if anything) to send upstream

Instead of sending the raw document to your AI, you extract the text and structure locally first. The model then works only with the extracted text — never the original file.

Quick Start

Add to your MCP client (example for Claude Code):

claude mcp add macos-vision-mcp -- npx -y macos-vision-mcp

Using Claude Desktop or Cursor? Jump to Configuration ↓

Note: On first run, the package downloads prebuilt Swift helper binaries (vision-helper, pdf-helper) from its GitHub Releases (~300 KB, ~1–2s). Subsequent invocations hit the npx cache and start instantly. Xcode Command Line Tools are only required as a fallback when the download can't reach the network — set MACOS_VISION_SKIP_DOWNLOAD=1 to force local compilation with swiftc.

Prefer instant cold-starts (no npx cache lookup)? Install globally with npm install -g macos-vision-mcp and use the alternative config shown at the bottom of Configuration.

Available Tools

Tool	What it does	Example prompt
`ocr_image`	Extract text from an image or PDF (JPG, PNG, HEIC, TIFF, PDF). Returns plain text, or per-page paragraphs + text blocks with `lineId` / `paragraphId` and bounding boxes. Accepts `start_page` / `max_pages` for partial PDF OCR.	"Read the text from ~/Desktop/screenshot.png"
`detect_faces`	Detect human faces and return their count and positions.	"How many people are in this photo?"
`detect_barcodes`	Read QR codes, EAN, UPC, Code128, PDF417, Aztec, and other 1D/2D codes.	"What does the QR code in /tmp/qr.jpg say?"
`detect_document`	Detect the four corner points of a document in a photo (paper, receipt, ID). Useful as a crop / deskew hint before OCR.	"Find the document corners in ~/Desktop/receipt.jpg"
`classify_image`	Classify image content into 1000+ categories with confidence scores.	"What is in this image?"
`analyze_document`	Returns structured JSON with reading-order paragraphs, raw text blocks (bbox / confidence), faces, barcodes, and rectangles — ready for the model to reconstruct into Markdown, HTML, or anything else. Also accepts `start_page` / `max_pages` for long PDFs.	"Reconstruct ~/Desktop/scan.pdf as clean Markdown"

Usage

Use the tool name explicitly in your prompt to guarantee local processing:

Extract text from an image or PDF:

Use ocr_image to extract text from ~/Desktop/invoice.pdf

Detect faces in a photo:

Use detect_faces on ~/Photos/team.jpg and tell me how many people are in it

Classify image content:

Use classify_image on ~/Downloads/unknown.jpg

Full document analysis + reconstruction:

Use analyze_document on ~/Desktop/report.pdf and reconstruct it as clean Markdown

The tool returns structured JSON; the model picks the output format you ask for (Markdown, HTML, DOCX outline, etc.) without any extra dependencies — no Ollama, no cloud LLM, no extra tooling.

Example workflows

Real-world combinations that work out of the box once the server is connected:

"Convert PDF → clean Markdown for LLM" — analyze_document returns reading-order paragraphs and bounding boxes; the model renders Markdown ready to drop into a docs site, knowledge base, or RAG pipeline.
"Extract invoice data locally before sending to GPT" — pull line items, totals, vendor, and dates from the PDF locally with analyze_document, then send only the structured JSON upstream. The original document never leaves your Mac.
"Scan receipts → JSON → expense tracker" — ocr_image on a phone photo, the model normalizes amount / date / merchant, and pipes the result straight into your expense tool's API.
"Decode a QR code from a screenshot" — detect_barcodes returns the decoded value plus symbology in one round trip.
"Crop a photo of a paper form before OCR" — detect_document returns the four corner points so you (or a downstream tool) can deskew and crop the image before reading the text.

Output schema (analyze_document)

{
  "source": { "path": "...", "pageCount": 1, "isPdf": false },
  "pages": [
    {
      "page": 0,
      // primary surface for reconstruction — reading-order paragraphs joined with "\n"
      "paragraphs": [
        { "paragraphId": 0, "lineIds": [0], "text": "ACME COFFEE" },
        { "paragraphId": 1, "lineIds": [1, 2], "text": "12 Main St\nPortland, OR" },
      ],
      // spatial fallback — raw blocks with page-local 0–1 bbox, confidence, line/paragraph membership
      "textBlocks": [
        {
          "text": "ACME COFFEE",
          "lineId": 0,
          "paragraphId": 0,
          "confidence": 0.99,
          "bbox": { "x": 0.21, "y": 0.04, "width": 0.58, "height": 0.06 },
        },
      ],
      "faces": [],
      "barcodes": [],
      "rectangles": [],
    },
  ],
  "summary": {
    "totalTextBlocks": 8,
    "totalParagraphs": 2,
    "totalFaces": 0,
    "totalBarcodes": 0,
    "totalRectangles": 0,
  },
}

Use paragraphs[].text for the 95% case (rebuild Markdown/HTML/plain text directly). Reach for textBlocks[] when you need spatial context — multi-column layouts, tables, forms, IDs.

Notes:

ocr_image in blocks mode returns the same per-page shape minus the detection sections: { pages: [{ page, paragraphs, textBlocks }] }.
PDFs are processed page by page. All coordinates are page-local (0–1), and paragraphId / lineId reset on every page.
Face, barcode, and rectangle detection on PDFs is best-effort — the underlying binary analyzes the file as a whole rather than per page, so any detections returned are attached to page 0 only.
Paragraph grouping uses spatial heuristics. For multi-column layouts (magazine spreads, wiki pages with side panels) the heuristic can collapse the whole page into a single paragraph. When that happens, fall back to textBlocks[] and reconstruct from the bounding boxes.

Configuration

Claude Code

claude mcp add macos-vision-mcp -- npx -y macos-vision-mcp

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "macos-vision-mcp": {
      "command": "npx",
      "args": ["-y", "macos-vision-mcp"]
    }
  }
}

Cursor

Add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "macos-vision-mcp": {
      "command": "npx",
      "args": ["-y", "macos-vision-mcp"]
    }
  }
}

Alternative: global install

If you'd rather skip the npx cache lookup on cold starts — or you want to pin a specific version — install once:

npm install -g macos-vision-mcp

Support

If macos-vision-mcp saved you tokens or kept a document on your Mac, consider starring the repo — it helps others find it.

Contributing

Contributions are welcome. Please follow Conventional Commits for commit messages — this project uses release-it with @release-it/conventional-changelog to automate releases.

git clone <repo>
cd macos-vision-mcp
npm install
npm run dev   # watch mode

License

MIT — Adrian Wolczuk