MinerU is a parsing API that converts URLs into clean Markdown with structured outputs. Point it at WeChat articles, PDFs, Office docs, or images and it handles layout, tables, formulas, and OCR better than basic web scrapers. You get a zip with extracted Markdown plus assets. The skill wraps the official mineru.net API with two scripts: an MCP-style batch parser that outputs JSON contracts, and a low-level single-URL extractor. It auto-picks between the HTML-optimized model and the general pipeline based on file extension. Useful when web_fetch gives you garbage or when you need high-fidelity extraction from paywalled or anti-bot pages, though it won't magically bypass authentication.
npx -y skills add blessonism/openclaw-search-skills --skill mineru-extract --agent claude-codeInstalls into .claude/skills of the current project.
Use MinerU as an upstream “content normalizer”: submit a URL to MinerU, poll for completion, download the result zip, and extract the main Markdown.
We align to the MinerU MCP mental model, but we do not run an MCP server.
scripts/mineru_parse_documents.py
--file-sources (comma/newline-separated){ ok, items, errors }scripts/mineru_extract.pyAuth:
MINERU_TOKEN (Bearer token from mineru.net)Default model heuristic:
.pdf/.doc/.ppt/.png/.jpg → pipelineMinerU-HTML (best for HTML pages like WeChat articles)Put secrets in skill root .env (do not paste into chat outputs):
# In the mineru-extract skill directory: .env
MINERU_TOKEN=your_token_here
MINERU_API_BASE=https://mineru.net
MCP-style wrapper (returns JSON, optionally includes markdown text):
python3 mineru-extract/scripts/mineru_parse_documents.py \
--file-sources "<URL1>\n<URL2>" \
--language ch \
--enable-ocr \
--model-version MinerU-HTML
If you want the markdown content inline in the JSON (can be large):
python3 mineru-extract/scripts/mineru_parse_documents.py \
--file-sources "<URL>" \
--model-version MinerU-HTML \
--emit-markdown --max-chars 20000
Low-level (single URL, print markdown to stdout):
python3 mineru-extract/scripts/mineru_extract.py "<URL>" --model MinerU-HTML --print > /tmp/out.md
The script always downloads + extracts the MinerU result zip to:
~/.openclaw/workspace/mineru/<task_id>/
It writes:
result.zipIt prints a JSON summary to stderr with paths:
task_id, full_zip_url, out_dir, markdown_path--model: pipeline | vlm | MinerU-HTML (HTML requires MinerU-HTML)--ocr/--no-ocr: enable OCR (effective for pipeline/vlm)--table/--no-table: table recognition--formula/--no-formula: formula recognition--language ch|en|...--page-ranges "2,4-6" (non-HTML)--timeout 600 / --poll-interval 2err_msg and keep an original-source link in outputs.juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills