MarkGrab

STDIOregistry active

Summary

This server wraps the markgrab Python library to turn arbitrary URLs into clean markdown that Claude can work with directly. It handles HTML articles with content density filtering, YouTube transcripts with timestamps, PDFs, and DOCX files. The extraction is async first, tries httpx for static pages, and falls back to Playwright when JavaScript rendering is needed. You'd reach for this when you want Claude to read web content without manually copying and pasting, or when you need to process documents from links in a conversation. The CLI and Python API support options like max character limits, forced browser rendering, and stealth mode for bot detection.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

MarkGrab

한국어 문서 · llms.txt

Universal web content extraction — any URL to LLM-ready markdown.

from markgrab import extract

result = await extract("https://example.com/article")
print(result.markdown)    # clean markdown
print(result.title)       # "Article Title"
print(result.word_count)  # 1234
print(result.language)    # "en"

Features

HTML — BeautifulSoup + content density filtering (removes nav, sidebar, ads)
YouTube — transcript extraction with timestamps
PDF — text extraction with page structure
DOCX — paragraph and heading extraction
Auto-fallback — tries lightweight httpx first, falls back to Playwright for JS-heavy pages
Async-first — built on httpx and Playwright async APIs

Install

pip install markgrab

Optional extras for specific content types:

pip install "markgrab[browser]"    # Playwright for JS-rendered pages
pip install "markgrab[youtube]"    # YouTube transcript extraction
pip install "markgrab[pdf]"       # PDF text extraction
pip install "markgrab[docx]"      # DOCX text extraction
pip install "markgrab[all]"       # everything

Usage

Python API

import asyncio
from markgrab import extract

async def main():
    # HTML (auto-detects content type)
    result = await extract("https://example.com/article")

    # YouTube transcript
    result = await extract("https://youtube.com/watch?v=dQw4w9WgXcQ")

    # PDF
    result = await extract("https://arxiv.org/pdf/1706.03762")

    # Options
    result = await extract(
        "https://example.com",
        max_chars=30_000,       # limit output length (default: 50K)
        use_browser=True,       # force Playwright rendering
        stealth=True,           # anti-bot stealth scripts (opt-in)
        timeout=60.0,           # request timeout in seconds
        proxy="http://proxy:8080",
    )

asyncio.run(main())

CLI

markgrab https://example.com                     # markdown output
markgrab https://example.com -f text             # plain text
markgrab https://example.com -f json             # structured JSON
markgrab https://example.com --browser           # force browser rendering
markgrab https://example.com --max-chars 10000   # limit output

ExtractResult

result.title        # page title
result.text         # plain text
result.markdown     # LLM-ready markdown
result.word_count   # word count
result.language     # detected language ("en", "ko", ...)
result.content_type # "article", "video", "pdf", "docx"
result.source_url   # final URL (after redirects)
result.metadata     # extra metadata (video_id, page_count, etc.)

How it works

flowchart TD
    A["🔗 URL Input"] --> B{"Content\nType?"}
    B -->|"HTML"| C["HTTP fetch\n(httpx)"]
    C --> D{"JS\nrequired?"}
    D -->|"no"| E["HTML Parser\n→ clean markdown"]
    D -->|"yes"| F["Playwright\nfallback"]
    F --> E
    B -->|"YouTube"| G["Transcript API\n→ timestamped markdown"]
    B -->|"PDF"| H["PDF Parser\n→ structured markdown"]
    B -->|"DOCX"| I["DOCX Parser\n→ markdown"]
    E --> J["✅ LLM-ready\nMarkdown"]
    G --> J
    H --> J
    I --> J

For HTML pages, if the initial httpx fetch yields fewer than 50 words, MarkGrab automatically retries with Playwright to handle JavaScript-rendered content.

Disclaimer

This software is provided for legitimate purposes only. By using MarkGrab, you agree to the following:

robots.txt: MarkGrab does not check or enforce robots.txt. Users are solely responsible for checking and respecting robots.txt directives and the terms of service of any website they access.
Rate limiting: MarkGrab does not include built-in rate limiting or request throttling. Users must implement their own rate limiting to avoid overloading target servers. Abusive request patterns may violate applicable laws and website terms of service.
YouTube transcripts: YouTube transcript extraction relies on the third-party youtube-transcript-api library, which uses YouTube's internal (unofficial) caption API. This may not comply with YouTube's Terms of Service. Use at your own discretion and risk.
Stealth mode: The optional stealth=True feature modifies browser fingerprinting signals to reduce bot detection. This feature is intended for legitimate use cases such as testing, research, and accessing content that is publicly available to regular browser users. Users are responsible for ensuring their use complies with applicable laws and the terms of service of target websites.
Legal compliance: Users are responsible for ensuring that their use of MarkGrab complies with all applicable laws, including but not limited to the Computer Fraud and Abuse Act (CFAA), the Digital Millennium Copyright Act (DMCA), GDPR, and equivalent legislation in their jurisdiction.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND. See the LICENSE file for the full MIT license text.

Acknowledgments

MarkGrab builds on excellent open-source work and well-established techniques:

puppeteer-extra-plugin-stealth — stealth evasion patterns (webdriver removal, plugin mocking, WebGL spoofing) that inspired the opt-in anti_bot/stealth.py module
Mozilla Readability — content area detection priority (article > main > body) and link density filtering concepts used in the density filter
Boilerpipe (Kohlschutter et al., 2010) — the academic origin of link density ratio algorithms for boilerplate removal
Jina Reader — validated the market need for URL-to-markdown extraction; MarkGrab aims to be a lightweight, self-hosted alternative

Built with httpx, BeautifulSoup, markdownify, Playwright, youtube-transcript-api, pdfplumber, and python-docx.

Used in

newswatch — RSS news monitoring pipeline (feedkit → markgrab → embgrep → diffgrab)
watchdeck — Web page monitoring with visual diffs and safety guards

License

MIT

_{Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.}

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

MarkGrab

한국어 문서 · llms.txt

Universal web content extraction — any URL to LLM-ready markdown.

from markgrab import extract

result = await extract("https://example.com/article")
print(result.markdown)    # clean markdown
print(result.title)       # "Article Title"
print(result.word_count)  # 1234
print(result.language)    # "en"

Features

HTML — BeautifulSoup + content density filtering (removes nav, sidebar, ads)
YouTube — transcript extraction with timestamps
PDF — text extraction with page structure
DOCX — paragraph and heading extraction
Auto-fallback — tries lightweight httpx first, falls back to Playwright for JS-heavy pages
Async-first — built on httpx and Playwright async APIs

Install

pip install markgrab

Optional extras for specific content types:

pip install "markgrab[browser]"    # Playwright for JS-rendered pages
pip install "markgrab[youtube]"    # YouTube transcript extraction
pip install "markgrab[pdf]"       # PDF text extraction
pip install "markgrab[docx]"      # DOCX text extraction
pip install "markgrab[all]"       # everything

Usage

Python API

import asyncio
from markgrab import extract

async def main():
    # HTML (auto-detects content type)
    result = await extract("https://example.com/article")

    # YouTube transcript
    result = await extract("https://youtube.com/watch?v=dQw4w9WgXcQ")

    # PDF
    result = await extract("https://arxiv.org/pdf/1706.03762")

    # Options
    result = await extract(
        "https://example.com",
        max_chars=30_000,       # limit output length (default: 50K)
        use_browser=True,       # force Playwright rendering
        stealth=True,           # anti-bot stealth scripts (opt-in)
        timeout=60.0,           # request timeout in seconds
        proxy="http://proxy:8080",
    )

asyncio.run(main())

CLI

markgrab https://example.com                     # markdown output
markgrab https://example.com -f text             # plain text
markgrab https://example.com -f json             # structured JSON
markgrab https://example.com --browser           # force browser rendering
markgrab https://example.com --max-chars 10000   # limit output

ExtractResult

result.title        # page title
result.text         # plain text
result.markdown     # LLM-ready markdown
result.word_count   # word count
result.language     # detected language ("en", "ko", ...)
result.content_type # "article", "video", "pdf", "docx"
result.source_url   # final URL (after redirects)
result.metadata     # extra metadata (video_id, page_count, etc.)

How it works

flowchart TD
    A["🔗 URL Input"] --> B{"Content\nType?"}
    B -->|"HTML"| C["HTTP fetch\n(httpx)"]
    C --> D{"JS\nrequired?"}
    D -->|"no"| E["HTML Parser\n→ clean markdown"]
    D -->|"yes"| F["Playwright\nfallback"]
    F --> E
    B -->|"YouTube"| G["Transcript API\n→ timestamped markdown"]
    B -->|"PDF"| H["PDF Parser\n→ structured markdown"]
    B -->|"DOCX"| I["DOCX Parser\n→ markdown"]
    E --> J["✅ LLM-ready\nMarkdown"]
    G --> J
    H --> J
    I --> J

For HTML pages, if the initial httpx fetch yields fewer than 50 words, MarkGrab automatically retries with Playwright to handle JavaScript-rendered content.

Disclaimer

This software is provided for legitimate purposes only. By using MarkGrab, you agree to the following:

robots.txt: MarkGrab does not check or enforce robots.txt. Users are solely responsible for checking and respecting robots.txt directives and the terms of service of any website they access.
Rate limiting: MarkGrab does not include built-in rate limiting or request throttling. Users must implement their own rate limiting to avoid overloading target servers. Abusive request patterns may violate applicable laws and website terms of service.
YouTube transcripts: YouTube transcript extraction relies on the third-party youtube-transcript-api library, which uses YouTube's internal (unofficial) caption API. This may not comply with YouTube's Terms of Service. Use at your own discretion and risk.
Stealth mode: The optional stealth=True feature modifies browser fingerprinting signals to reduce bot detection. This feature is intended for legitimate use cases such as testing, research, and accessing content that is publicly available to regular browser users. Users are responsible for ensuring their use complies with applicable laws and the terms of service of target websites.
Legal compliance: Users are responsible for ensuring that their use of MarkGrab complies with all applicable laws, including but not limited to the Computer Fraud and Abuse Act (CFAA), the Digital Millennium Copyright Act (DMCA), GDPR, and equivalent legislation in their jurisdiction.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND. See the LICENSE file for the full MIT license text.

Acknowledgments

MarkGrab builds on excellent open-source work and well-established techniques:

puppeteer-extra-plugin-stealth — stealth evasion patterns (webdriver removal, plugin mocking, WebGL spoofing) that inspired the opt-in anti_bot/stealth.py module
Mozilla Readability — content area detection priority (article > main > body) and link density filtering concepts used in the density filter
Boilerpipe (Kohlschutter et al., 2010) — the academic origin of link density ratio algorithms for boilerplate removal
Jina Reader — validated the market need for URL-to-markdown extraction; MarkGrab aims to be a lightweight, self-hosted alternative

Built with httpx, BeautifulSoup, markdownify, Playwright, youtube-transcript-api, pdfplumber, and python-docx.

Used in

newswatch — RSS news monitoring pipeline (feedkit → markgrab → embgrep → diffgrab)
watchdeck — Web page monitoring with visual diffs and safety guards

License

MIT

_{Part of the QuartzUnit ecosystem — composable Python libraries for data collection, extraction, search, and AI agent safety.}

MarkGrab

MarkGrab

Features

Install

Usage

Python API

CLI

ExtractResult

How it works

Disclaimer

Acknowledgments

Used in

License

MarkGrab

MarkGrab

Features

Install

Usage

Python API

CLI

ExtractResult

How it works

Disclaimer

Acknowledgments

Used in

License

Related AI & LLM Tools MCP Servers

Related AI & LLM Tools MCP Servers