A straightforward Python scraper that saves web pages as HTML or Markdown with downloaded images, no browser automation required. Just requests and BeautifulSoup4. Single page mode is fast for archiving articles or documentation. Recursive mode follows links with sensible guardrails: depth limits, rate limiting, robots.txt compliance by default. Images get saved to a local folder automatically, which is nice for offline viewing. The recursive crawling is sequential rather than parallel, so it's polite but not blazing fast. Good for documentation snapshots or building local archives when you need actual files on disk, not just viewing in a browser.
npx -y skills add agentbay-ai/agentbay-skills --skill web-scraper --agent claude-codeInstalls into .claude/skills of the current project.
Fetch web page content (text + images) and save as HTML or Markdown locally.
Minimal dependencies: Only requires requests and beautifulsoup4 - no browser automation.
Default behavior: Downloads images to local images/ directory automatically.
{baseDir}/scripts/scrape.py --url "https://example.com" --format html --output /tmp/page.html
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --output /tmp/page.md
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --max-depth 2 --output ~/Downloads/docs-archive
Requires Python 3.8+ and minimal dependencies:
cd {baseDir}
pip install -r requirements.txt
Or install manually:
pip install requests beautifulsoup4
Note: No browser or driver needed - uses pure HTTP requests.
html or md (default: html)--no-download-images to disable)html or mdimages/ folder/tmp/ or ~/Downloads/){baseDir}/scripts/scrape.py --url "https://docs.openclaw.ai/start/quickstart" --format html --output ~/Downloads/openclaw-quickstart.html
{baseDir}/scripts/scrape.py --url "https://en.wikipedia.org/wiki/Web_scraping" --format md --output ~/Documents/web-scraping.md
Result: Creates web-scraping.md + images/ folder with all downloaded images (text + images).
{baseDir}/scripts/scrape.py --url "https://example.com" --format md --no-download-images
Result: Only text + image URLs (not downloaded locally).
{baseDir}/scripts/scrape.py --url "https://example.com" --format html
# Saves to: example-com-{timestamp}.html
{baseDir}/scripts/scrape.py --url "https://docs.example.com" --format md --recursive --output ~/Downloads/docs-archive
Output structure (text + images for all pages):
docs-archive/
├── index.md
├── getting-started.md
├── api/
│ ├── authentication.md
│ └── endpoints.md
└── images/ # Shared images from all pages
├── logo.png
└── diagram.svg
{baseDir}/scripts/scrape.py \
--url "https://blog.example.com" \
--format html \
--recursive \
--max-depth 3 \
--max-pages 100 \
--output ~/Archives/blog-backup
{baseDir}/scripts/scrape.py \
--url "https://example.com" \
--format md \
--recursive \
--no-respect-robots \
--rate-limit 1.0
{baseDir}/scripts/scrape.py \
--url "https://yoursite.com" \
--format md \
--recursive \
--rate-limit 0.2
images/ folderimages/ directory (default)--no-download-images flag to keep original URLs only--recursive)--max-depth limits how many levels deep to crawl (default: 2)--max-pages caps total pages to prevent runaway crawls (default: 50)--same-domain keeps crawl within starting domain (default: on)--rate-limit adds delay between requests (default: 0.5s)--max-depth 1 --max-pages 10 first--no-respect-robots for your own sites--same-domain enabled--timeout flag for slow-loading pages (value in seconds)juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills