Mineru Extract

170 installs432 stars

Summary

MinerU is a parsing API that converts URLs into clean Markdown with structured outputs. Point it at WeChat articles, PDFs, Office docs, or images and it handles layout, tables, formulas, and OCR better than basic web scrapers. You get a zip with extracted Markdown plus assets. The skill wraps the official mineru.net API with two scripts: an MCP-style batch parser that outputs JSON contracts, and a low-level single-URL extractor. It auto-picks between the HTML-optimized model and the general pipeline based on file extension. Useful when web_fetch gives you garbage or when you need high-fidelity extraction from paywalled or anti-bot pages, though it won't magically bypass authentication.

Install to Claude Code

npx -y skills add blessonism/openclaw-search-skills --skill mineru-extract --agent claude-code

Installs into .claude/skills of the current project.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Files

SKILL.mdView on GitHub

MinerU Extract (official API)

Use MinerU as an upstream “content normalizer”: submit a URL to MinerU, poll for completion, download the result zip, and extract the main Markdown.

Quick start (MCP-aligned)

We align to the MinerU MCP mental model, but we do not run an MCP server.

Primary script (MCP-style): scripts/mineru_parse_documents.py
- Input: --file-sources (comma/newline-separated)
- Output: JSON contract on stdout: { ok, items, errors }
Low-level script (single URL): scripts/mineru_extract.py

Auth:

Set MINERU_TOKEN (Bearer token from mineru.net)

Default model heuristic:

URLs ending with .pdf/.doc/.ppt/.png/.jpg → pipeline
Otherwise → MinerU-HTML (best for HTML pages like WeChat articles)

1) Configure token (skill-local)

Put secrets in skill root .env (do not paste into chat outputs):

# In the mineru-extract skill directory: .env
MINERU_TOKEN=your_token_here
MINERU_API_BASE=https://mineru.net

2) Parse URL(s) → Markdown (recommended)

MCP-style wrapper (returns JSON, optionally includes markdown text):

python3 mineru-extract/scripts/mineru_parse_documents.py \
  --file-sources "<URL1>\n<URL2>" \
  --language ch \
  --enable-ocr \
  --model-version MinerU-HTML

If you want the markdown content inline in the JSON (can be large):

python3 mineru-extract/scripts/mineru_parse_documents.py \
  --file-sources "<URL>" \
  --model-version MinerU-HTML \
  --emit-markdown --max-chars 20000

Low-level (single URL, print markdown to stdout):

python3 mineru-extract/scripts/mineru_extract.py "<URL>" --model MinerU-HTML --print > /tmp/out.md

Output

The script always downloads + extracts the MinerU result zip to:

~/.openclaw/workspace/mineru/<task_id>/

It writes:

result.zip
extracted files (Markdown + JSON + assets)

It prints a JSON summary to stderr with paths:

task_id, full_zip_url, out_dir, markdown_path

Parameters (common)

--model: pipeline | vlm | MinerU-HTML (HTML requires MinerU-HTML)
--ocr/--no-ocr: enable OCR (effective for pipeline/vlm)
--table/--no-table: table recognition
--formula/--no-formula: formula recognition
--language ch|en|...
--page-ranges "2,4-6" (non-HTML)
--timeout 600 / --poll-interval 2

Failure modes & fallbacks

MinerU may fail to fetch some URLs (anti-bot / geo / login).
- Fallback: provide an HTML file or a PDF/long screenshot; then implement “upload + parse” flow with MinerU batch upload endpoints.
- Always report the failing URL + MinerU err_msg and keep an original-source link in outputs.

References

MinerU API docs: https://mineru.net/apiManage/docs
MinerU output files: https://opendatalab.github.io/MinerU/reference/output_files/

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

First SeenJun 3, 2026

View on GitHub

Quick start (MCP-aligned)

We align to the MinerU MCP mental model, but we do not run an MCP server.

Primary script (MCP-style): scripts/mineru_parse_documents.py

Input: --file-sources (comma/newline-separated)
Output: JSON contract on stdout: { ok, items, errors }

Low-level script (single URL): scripts/mineru_extract.py

Auth:

Set MINERU_TOKEN (Bearer token from mineru.net)

Default model heuristic:

URLs ending with .pdf/.doc/.ppt/.png/.jpg → pipeline

Otherwise → MinerU-HTML (best for HTML pages like WeChat articles)

1) Configure token (skill-local)

Put secrets in skill root .env (do not paste into chat outputs):

# In the mineru-extract skill directory: .env MINERU_TOKEN=your_token_here MINERU_API_BASE=https://mineru.net

2) Parse URL(s) → Markdown (recommended)

MCP-style wrapper (returns JSON, optionally includes markdown text):

python3 mineru-extract/scripts/mineru_parse_documents.py \ --file-sources "<URL1>\n<URL2>" \ --language ch \ --enable-ocr \ --model-version MinerU-HTML

If you want the markdown content inline in the JSON (can be large):

python3 mineru-extract/scripts/mineru_parse_documents.py \ --file-sources "<URL>" \ --model-version MinerU-HTML \ --emit-markdown --max-chars 20000

Low-level (single URL, print markdown to stdout):

python3 mineru-extract/scripts/mineru_extract.py "<URL>" --model MinerU-HTML --print > /tmp/out.md

Mineru Extract

Install to Claude Code

MinerU Extract (official API)

Quick start (MCP-aligned)

1) Configure token (skill-local)

2) Parse URL(s) → Markdown (recommended)

Output

Parameters (common)

Failure modes & fallbacks

References

Mineru Extract

Install to Claude Code

MinerU Extract (official API)

Quick start (MCP-aligned)

1) Configure token (skill-local)

2) Parse URL(s) → Markdown (recommended)

Output

Parameters (common)

Failure modes & fallbacks

References

Recommended

Recommended