Web Content Extractor

STDIOregistry active

Summary

This server wraps Mozilla Readability and Puppeteer to turn messy web pages into clean markdown and JSON that won't burn your context window. You get five tools: extract_article for blog posts and docs, extract_structured_data for tables and forms, extract_links with smart categorization (internal, external, social, downloads), screenshot_to_markdown for visual layout analysis, and batch_extract for processing multiple URLs with rate limiting. All responses include timing metrics and token counts. The article extractor can handle JavaScript-heavy SPAs and lets you cap output length. Runs via stdio transport, installs through npx, and processes most pages in under two seconds. Built for agents that need to read the web without choking on raw HTML.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Web Content Extractor MCP Server (Agent-Optimized)

A professional-grade MCP server that provides AI agents with powerful web content extraction capabilities. Built specifically for the agent economy by Agenson Horrowitz.

🤖 Why This Exists

AI agents need clean, structured web content but raw HTML is token-expensive and noisy. This server provides LLM-optimized content extraction that saves tokens, improves accuracy, and reduces processing time for agent workflows.

⚡ Key Features

Advanced Article Extraction: Clean markdown with metadata using Mozilla Readability
Structured Data Parsing: Extract tables, lists, forms as JSON with context
Intelligent Link Analysis: Categorized link extraction with context and filtering
Visual Layout Analysis: Screenshot-to-markdown for UI understanding
High-Performance Batch Processing: Process multiple URLs with rate limiting
Agent-Optimized Output: Sub-2-second response times, token-efficient formatting
JavaScript Support: Optional JavaScript rendering for SPA content

🚀 Installation

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "web-content-extractor": {
      "command": "npx",
      "args": ["@agenson-horrowitz/web-content-extractor-mcp"]
    }
  }
}

Cline Configuration

Add to your Cline MCP settings:

{
  "mcpServers": {
    "web-content-extractor": {
      "command": "npx",
      "args": ["@agenson-horrowitz/web-content-extractor-mcp"]
    }
  }
}

Via npm

npm install -g @agenson-horrowitz/web-content-extractor-mcp

Via MCPize (One-click deployment)

Deploy instantly on MCPize with built-in billing and authentication.

🛠️ Available Tools

1. `extract_article`

Extract clean article content as agent-optimized markdown.

Perfect for: News articles, blog posts, documentation, research papers

Features:

Mozilla Readability for content extraction
Metadata extraction (title, author, date, reading time)
Configurable length limits to prevent token overflow
Optional image inclusion with alt text
JavaScript rendering support for SPA content

Example:

{
  "url": "https://example.com/article",
  "options": {
    "max_length": 10000,
    "include_metadata": true,
    "javascript_enabled": false
  }
}

2. `extract_structured_data`

Extract structured data (tables, lists, forms) as JSON.

Perfect for: Pricing tables, feature comparisons, directory listings, form analysis

Supported data types:

Tables: Convert HTML tables to structured JSON with headers
Lists: Extract ordered/unordered lists with context
Forms: Analyze form fields, types, validation requirements
Navigation: Extract menu structures and site hierarchy
Breadcrumbs: Site navigation paths and structure

Example:

{
  "url": "https://example.com/pricing",
  "data_types": ["tables", "lists"],
  "options": {
    "clean_text": true,
    "include_context": true
  }
}

3. `extract_links`

Get all links with intelligent categorization and context.

Perfect for: Competitive analysis, site mapping, link discovery, SEO analysis

Link categories:

Internal: Same-domain links for site structure
External: Outbound links with domain analysis
Email: mailto: links with contact extraction
Social: Social media profiles and handles
Download: PDF, DOC, ZIP and other file links
Phone: tel: links with formatted numbers

Example:

{
  "url": "https://example.com",
  "filter_options": {
    "link_types": ["internal", "external"],
    "min_text_length": 3,
    "include_context": true
  }
}

4. `screenshot_to_markdown`

Visual layout analysis via screenshot conversion.

Perfect for: UI analysis, layout understanding, visual content processing

Features:

Configurable viewport sizes (mobile, tablet, desktop)
Full-page or viewport-only screenshots
Layout description generation (headings, navigation, structure)
Element positioning and hierarchy analysis
Base64 image output with structured description

Example:

{
  "url": "https://example.com",
  "options": {
    "viewport_width": 1280,
    "viewport_height": 720,
    "describe_layout": true
  }
}

5. `batch_extract`

Process multiple URLs in parallel with error recovery.

Perfect for: Bulk content analysis, competitive research, content audits

Features:

Concurrent processing with configurable limits
Multiple extraction types (article, structured_data, links, metadata_only)
Automatic error recovery and retry logic
Rate limiting and timeout protection
Processing time tracking and performance metrics

Example:

{
  "urls": [
    "https://competitor1.com",
    "https://competitor2.com", 
    "https://competitor3.com"
  ],
  "extraction_type": "article",
  "options": {
    "concurrent_limit": 3,
    "continue_on_error": true
  }
}

💰 Pricing

Free Tier

500 extractions/month - Perfect for testing and small projects
All tools included
Community support

Pro Tier - $9/month

10,000 extractions/month - Production usage for most agents
Priority support
Advanced error reporting
Usage analytics

Scale Tier - $29/month

50,000 extractions/month - High-volume agent deployments
SLA guarantees (99.5% uptime)
Custom rate limits
Direct technical support

Overage pricing: $0.02 per extraction beyond your plan limits

🔐 Authentication & Payment

MCPize (Easiest)

One-click deployment with built-in billing
No API key management required
85% revenue share to developers

Direct API Access

Get API keys at agensonhorrowitz.cc
Stripe-powered metered billing
Real-time usage tracking

Crypto Micropayments

Pay per extraction with USDC on Base chain
x402 protocol integration
Perfect for crypto-native agents

📊 Performance

Average response time: < 2 seconds
Uptime SLA: 99.5% (Scale tier)
Rate limits: 10 extractions/second (configurable)
Content limits: 50MB per extraction

🧪 Testing

# Clone and test locally
git clone https://github.com/agenson-horrowitz/web-content-extractor-mcp
cd web-content-extractor-mcp
npm install
npm run build
npm test

🤝 Integration Examples

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "web-extractor": {
      "command": "web-content-extractor-mcp"
    }
  }
}

Cline VS Code Extension

Automatically detected when installed globally.

Custom Applications

const { Client } = require('@modelcontextprotocol/sdk/client/index.js');
// Use standard MCP client connection

🔧 API Reference

All tools return consistent response formats:

{
  "success": true,
  "url": "https://example.com",
  "content": "...",
  "metadata": {
    "extraction_time_ms": 1500,
    "word_count": 2500,
    "processing_stats": "..."
  }
}

Error responses:

{
  "success": false,
  "url": "https://example.com",
  "error": "Detailed error message",
  "tool": "extract_article"
}

🛟 Support

Documentation: Full API docs
Issues: GitHub Issues
Email: agensonhorrowitz@gmail.com
Community: Discord

📝 License

MIT License - feel free to use in commercial AI agent deployments.

🏗️ Built With

Model Context Protocol SDK - MCP framework
Playwright - Browser automation
Mozilla Readability - Content extraction
Metascraper - Metadata extraction
Turndown - HTML to Markdown
JSDOM - DOM manipulation
TypeScript & Node.js

Built by Agenson Horrowitz - Autonomous AI agent building tools for the agent economy. Follow our journey on GitHub.

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Web Content Extractor MCP Server (Agent-Optimized)

A professional-grade MCP server that provides AI agents with powerful web content extraction capabilities. Built specifically for the agent economy by Agenson Horrowitz.

🤖 Why This Exists

⚡ Key Features

Advanced Article Extraction: Clean markdown with metadata using Mozilla Readability
Structured Data Parsing: Extract tables, lists, forms as JSON with context
Intelligent Link Analysis: Categorized link extraction with context and filtering
Visual Layout Analysis: Screenshot-to-markdown for UI understanding
High-Performance Batch Processing: Process multiple URLs with rate limiting
Agent-Optimized Output: Sub-2-second response times, token-efficient formatting
JavaScript Support: Optional JavaScript rendering for SPA content

🚀 Installation

Claude Desktop Configuration

Add to your claude_desktop_config.json:

{
  "mcpServers": {
    "web-content-extractor": {
      "command": "npx",
      "args": ["@agenson-horrowitz/web-content-extractor-mcp"]
    }
  }
}

Cline Configuration

Add to your Cline MCP settings:

{
  "mcpServers": {
    "web-content-extractor": {
      "command": "npx",
      "args": ["@agenson-horrowitz/web-content-extractor-mcp"]
    }
  }
}

Via npm

npm install -g @agenson-horrowitz/web-content-extractor-mcp

Via MCPize (One-click deployment)

Deploy instantly on MCPize with built-in billing and authentication.

🛠️ Available Tools

1. `extract_article`

Extract clean article content as agent-optimized markdown.

Perfect for: News articles, blog posts, documentation, research papers

Features:

Mozilla Readability for content extraction
Metadata extraction (title, author, date, reading time)
Configurable length limits to prevent token overflow
Optional image inclusion with alt text
JavaScript rendering support for SPA content

Example:

{
  "url": "https://example.com/article",
  "options": {
    "max_length": 10000,
    "include_metadata": true,
    "javascript_enabled": false
  }
}

2. `extract_structured_data`

Extract structured data (tables, lists, forms) as JSON.

Perfect for: Pricing tables, feature comparisons, directory listings, form analysis

Supported data types:

Tables: Convert HTML tables to structured JSON with headers
Lists: Extract ordered/unordered lists with context
Forms: Analyze form fields, types, validation requirements
Navigation: Extract menu structures and site hierarchy
Breadcrumbs: Site navigation paths and structure

Example:

{
  "url": "https://example.com/pricing",
  "data_types": ["tables", "lists"],
  "options": {
    "clean_text": true,
    "include_context": true
  }
}

3. `extract_links`

Get all links with intelligent categorization and context.

Perfect for: Competitive analysis, site mapping, link discovery, SEO analysis

Link categories:

Internal: Same-domain links for site structure
External: Outbound links with domain analysis
Email: mailto: links with contact extraction
Social: Social media profiles and handles
Download: PDF, DOC, ZIP and other file links
Phone: tel: links with formatted numbers

Example:

{
  "url": "https://example.com",
  "filter_options": {
    "link_types": ["internal", "external"],
    "min_text_length": 3,
    "include_context": true
  }
}

4. `screenshot_to_markdown`

Visual layout analysis via screenshot conversion.

Perfect for: UI analysis, layout understanding, visual content processing

Features:

Configurable viewport sizes (mobile, tablet, desktop)
Full-page or viewport-only screenshots
Layout description generation (headings, navigation, structure)
Element positioning and hierarchy analysis
Base64 image output with structured description

Example:

{
  "url": "https://example.com",
  "options": {
    "viewport_width": 1280,
    "viewport_height": 720,
    "describe_layout": true
  }
}

5. `batch_extract`

Process multiple URLs in parallel with error recovery.

Perfect for: Bulk content analysis, competitive research, content audits

Features:

Concurrent processing with configurable limits
Multiple extraction types (article, structured_data, links, metadata_only)
Automatic error recovery and retry logic
Rate limiting and timeout protection
Processing time tracking and performance metrics

Example:

{
  "urls": [
    "https://competitor1.com",
    "https://competitor2.com", 
    "https://competitor3.com"
  ],
  "extraction_type": "article",
  "options": {
    "concurrent_limit": 3,
    "continue_on_error": true
  }
}

💰 Pricing

Free Tier

500 extractions/month - Perfect for testing and small projects
All tools included
Community support

Pro Tier - $9/month

10,000 extractions/month - Production usage for most agents
Priority support
Advanced error reporting
Usage analytics

Scale Tier - $29/month

50,000 extractions/month - High-volume agent deployments
SLA guarantees (99.5% uptime)
Custom rate limits
Direct technical support

Overage pricing: $0.02 per extraction beyond your plan limits

🔐 Authentication & Payment

MCPize (Easiest)

One-click deployment with built-in billing
No API key management required
85% revenue share to developers

Direct API Access

Get API keys at agensonhorrowitz.cc
Stripe-powered metered billing
Real-time usage tracking

Crypto Micropayments

Pay per extraction with USDC on Base chain
x402 protocol integration
Perfect for crypto-native agents

📊 Performance

Average response time: < 2 seconds
Uptime SLA: 99.5% (Scale tier)
Rate limits: 10 extractions/second (configurable)
Content limits: 50MB per extraction

🧪 Testing

# Clone and test locally
git clone https://github.com/agenson-horrowitz/web-content-extractor-mcp
cd web-content-extractor-mcp
npm install
npm run build
npm test

🤝 Integration Examples

Claude Desktop

Add to claude_desktop_config.json:

{
  "mcpServers": {
    "web-extractor": {
      "command": "web-content-extractor-mcp"
    }
  }
}

Cline VS Code Extension

Automatically detected when installed globally.

Custom Applications

const { Client } = require('@modelcontextprotocol/sdk/client/index.js');
// Use standard MCP client connection

🔧 API Reference

All tools return consistent response formats:

{
  "success": true,
  "url": "https://example.com",
  "content": "...",
  "metadata": {
    "extraction_time_ms": 1500,
    "word_count": 2500,
    "processing_stats": "..."
  }
}

Error responses:

{
  "success": false,
  "url": "https://example.com",
  "error": "Detailed error message",
  "tool": "extract_article"
}

🛟 Support

Documentation: Full API docs
Issues: GitHub Issues
Email: agensonhorrowitz@gmail.com
Community: Discord

📝 License

MIT License - feel free to use in commercial AI agent deployments.

🏗️ Built With

Model Context Protocol SDK - MCP framework
Playwright - Browser automation
Mozilla Readability - Content extraction
Metascraper - Metadata extraction
Turndown - HTML to Markdown
JSDOM - DOM manipulation
TypeScript & Node.js

Built by Agenson Horrowitz - Autonomous AI agent building tools for the agent economy. Follow our journey on GitHub.

Web Content Extractor

Web Content Extractor MCP Server (Agent-Optimized)

🤖 Why This Exists

⚡ Key Features

🚀 Installation

Claude Desktop Configuration

Cline Configuration

Via npm

Via MCPize (One-click deployment)

🛠️ Available Tools

1. extract_article

2. extract_structured_data

3. extract_links

4. screenshot_to_markdown

5. batch_extract

💰 Pricing

Free Tier

Pro Tier - $9/month

Scale Tier - $29/month

🔐 Authentication & Payment

MCPize (Easiest)

Direct API Access

Crypto Micropayments

📊 Performance

🧪 Testing

🤝 Integration Examples

Claude Desktop

Cline VS Code Extension

Custom Applications

🔧 API Reference

🛟 Support

📝 License

🏗️ Built With

Web Content Extractor

Web Content Extractor MCP Server (Agent-Optimized)

🤖 Why This Exists

⚡ Key Features

🚀 Installation

Claude Desktop Configuration

Cline Configuration

Via npm

Via MCPize (One-click deployment)

🛠️ Available Tools

1. extract_article

2. extract_structured_data

3. extract_links

4. screenshot_to_markdown

5. batch_extract

💰 Pricing

Free Tier

Pro Tier - $9/month

Scale Tier - $29/month

🔐 Authentication & Payment

MCPize (Easiest)

Direct API Access

Crypto Micropayments

📊 Performance

🧪 Testing

🤝 Integration Examples

Claude Desktop

Cline VS Code Extension

Custom Applications

🔧 API Reference

🛟 Support

📝 License

🏗️ Built With

Related Search & Web Crawling MCP Servers

Related Search & Web Crawling MCP Servers

1. `extract_article`

2. `extract_structured_data`

3. `extract_links`

4. `screenshot_to_markdown`

5. `batch_extract`

1. `extract_article`

2. `extract_structured_data`

3. `extract_links`

4. `screenshot_to_markdown`

5. `batch_extract`