This is a smart URL-to-Markdown converter that knows when to take shortcuts and when to bring out the heavy machinery. It tries a cheap web_fetch first, and if that hits anti-scraping walls or returns garbage (common with WeChat articles on mp.weixin.qq.com), it falls back to the MinerU API for high-fidelity extraction. The output is a structured JSON contract with source URLs, the cleaned Markdown, and file paths so you can always trace where content came from. Built for workflows where you need reliable extraction without babysitting failed requests or dealing with "please open in WeChat client" dead ends. The domain whitelist for known problem sites is a nice touch.
npx -y skills add blessonism/openclaw-search-skills --skill content-extract --agent claude-codeInstalls into .claude/skills of the current project.
目标:把“给我一个 URL → 产出可读 Markdown + 可追溯入口”变成一个统一入口,供后续所有业务 skill(github-explorer、写作类 skills、日报等)复用。
核心原则(来自你发的 Excel Skill 拆解文章的启发):
输入:url
references/domain-whitelist.mdmodel_version=MinerU-HTMLweb_fetch(url)references/heuristics.md)包括:
skills/mineru-extract/scripts/mineru_parse_documents.pymodel_version=MinerU-HTML无论用 probe 还是 MinerU,都返回同一套结构:
{
"ok": true,
"source_url": "...",
"engine": "web_fetch" ,
"markdown": "...",
"artifacts": {
"out_dir": "...",
"markdown_path": "...",
"zip_path": "..."
},
"sources": [
"原文URL",
"(如使用MinerU)MinerU full_zip_url",
"(如使用MinerU)本地markdown_path"
],
"notes": ["任何重要限制/失败原因/下一步建议"]
}
注意:
engine可能是web_fetch或mineru。
当需要 MinerU 时,用这个命令(返回 JSON,且可把 markdown 内联进 JSON,便于下游总结):
python3 mineru-extract/scripts/mineru_parse_documents.py \
--file-sources "<URL>" \
--model-version MinerU-HTML \
--emit-markdown --max-chars 20000
路径说明: 上述命令假设你在 skills 安装根目录下执行。如果 mineru-extract 安装在其他位置,请替换为实际路径。
sources(原文入口 + 解析产物入口)。markdown_path(本地路径)写进 sources,方便复查。juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills