This is a practical implementation for crawling documentation sites and blogs while respecting robots.txt and rate limits. It combines trafilatura for clean content extraction with BeautifulSoup for structure parsing, then converts everything to markdown for RAG ingestion. The code handles common doc frameworks like Docusaurus and Sphinx, extracts sidebar navigation and prev/next links, and includes sitemap discovery. Worth noting that it tracks visited URLs and content hashes for incremental updates, which matters when you're maintaining a knowledge base that needs to stay current. The robots.txt checker is solid, and the configurable rate limiting (defaults to 1 second between requests) keeps you from being a bad citizen.
npx -y skills add mindmorass/reflex --skill site-crawler --agent claude-codeInstalls into .claude/skills of the current project.
Select a file.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills