Web Crawler

1.8k installs13 stars

Summary

This is the fallback scraper you reach for when web_fetch hits a wall or you need social media data. It routes to ScrapeCreators for 27+ platforms (TikTok, Instagram, YouTube, LinkedIn, Reddit, Threads, the whole roster), SerpApi for YouTube transcripts, and Firecrawl when a page blocks basic fetching. The OpenAPI routing is smart: you specify intent (profile, posts, single video), it fetches the spec for that endpoint, then makes the call. Costs money per request, so use native fetch first and invoke this when you actually need profiles, comments, transcripts, or anti-bot evasion. The scope is huge but the use case is clear: structured extraction from platforms that don't want to be scraped.

Install to Claude Code

npx -y skills add starchild-ai-agent/official-skills --skill web-crawler --agent claude-code

Installs into .claude/skills of the current project.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Files

SKILL.mdView on GitHub

Preferred entry: call exports.py (don't hand-roll requests)

Ready-made helpers live in skills/web-crawler/exports.py. Prefer them over writing your own proxied_get/proxied_post calls — they already inject the proxy credentials, so there is no API key to find (don't read $SCRAPECREATORS_API_KEY / $FIRECRAWL_API_KEY, don't check .env, don't ask the user).

import sys; sys.path.insert(0, "/data/workspace/skills/web-crawler")
from exports import scrape_markdown, youtube_transcript, sc_get
scrape_markdown("https://example.com/article")          # Firecrawl fallback
youtube_transcript("https://youtube.com/watch?v=ID")     # ScrapeCreators
sc_get("/v1/tiktok/profile", handle="charlidamelio")     # any SC endpoint

from exports import archive_fallback                      # paywall / Firecrawl-403
archive_fallback("https://www.nytimes.com/.../article.html")  # archive snapshot

Named wrappers exist for the high-frequency actions (YouTube/TikTok transcript & video, IG/Twitter/Reddit posts, profiles, Google/Reddit search). For any other ScrapeCreators endpoint use sc_get(path, **params) — it auto-strips leading @/# from handles/hashtags. The intent-routing tables below still tell you which endpoint to pass. Pass caller_id="chat:<thread>" (or job:/preview:) for cost tracking.

Quick trigger rules (read this first)

Use this skill immediately when any of these conditions is true:

web_fetch returns HTTP 401/403/429/5xx
Response is an anti-bot challenge page (for example Cloudflare "Attention Required", "Just a moment", or challenge/captcha pages)
The page is JS-heavy and the first fetch misses required detail fields (for example publish time, author, listing code, updated time, price breakdown)
Search results do not contain the requested field and the value must be extracted from the target page itself

Fallback rule:

If ordinary fetch is blocked or incomplete, switch to Firecrawl fallback in this skill before asking the user for screenshots/manual text.

Error signatures -> action

Signature	Action
`web_fetch` HTTP 401/403/429/5xx	Call Firecrawl `POST /v2/scrape` once with `formats:["markdown","links"]` + `onlyMainContent:true`
Cloudflare/challenge page text in body	Same Firecrawl call as above
Markdown still misses key fields	Retry once with `formats:["rawHtml"]`
Firecrawl returns 403 / empty for a social media URL	Check the intent-routing tables below for a ScrapeCreators platform-specific endpoint for this domain (e.g. `sc_get('/v1/instagram/post', url=...)`, `sc_get('/v2/tiktok/video', url=...)`). If one exists, use it — these have dedicated extraction that bypasses anti-bot. If no dedicated endpoint exists, fall through to `archive_fallback` or ask the user.
Firecrawl itself returns 403 / empty (hard paywall: NYT, WSJ, Economist, FT, Bloomberg)	Call `archive_fallback(url)` — recovers full text from a web archive snapshot
Need structured data from a China app (抖音/小红书/微博/B站/京东/淘宝/1688/闲鱼/得物 etc.)	Call `apify_run()` — Apify Store has purpose-built actors for these platforms that Firecrawl/ScrapeCreators don't cover

Paywall / Firecrawl-blocked fallback chain (use `archive_fallback`)

When Firecrawl can't get the page either (it returns 403, or markdown comes back empty) the site is behind a hard paywall or aggressive WAF. Do NOT keep retrying Firecrawl. Recover the article from a web archive snapshot instead:

from exports import archive_fallback
res = archive_fallback("https://www.nytimes.com/.../article.html")
res["markdown"]      # full text, or "" if no snapshot exists anywhere
res["source"]        # "archive.today" | "wayback" | None
res["snapshot_url"]  # the snapshot that was scraped

How it works (and why this order):

archive.today first (archive.ph / archive.is mirrors). User-triggered, real-browser captures; historically preserves full text behind paywalls. Best bet for NYT/WSJ/Economist. We scrape its /newest/ snapshot via Firecrawl (archive.today has its own Cloudflare, so scrape it through Firecrawl, never web_fetch it directly).
Wayback Machine second (archive.org). Automated crawler that honors robots.txt and paywalls, so it often has NO full text for hard paywalls — but it's a good fallback for ordinary 403/Cloudflare pages that aren't paywalled.

Limitation: archives only return text someone already saved. If res["markdown"] == "", no snapshot exists — stop, tell the user, and try the outlet's official API/RSS or a different source. Do not fabricate the article.

China-app structured data (use `apify_run`)

When you need structured data from a China app — Douyin video search, Xiaohongshu notes, Weibo posts, Bilibili videos, JD/Taobao product prices, 1688 wholesale listings, Xianyu second-hand, Dewu sneakers, etc. — Firecrawl and ScrapeCreators don't cover these platforms. Use the Apify Store fallback instead. Apify is a serverless scraper marketplace with hundreds of community-maintained actors that run real browsers + proxy pools against Chinese platforms.

Auth: No user-supplied key needed. sc-proxy injects the platform Apify token automatically. The Authorization: Bearer header can be any fake value — the proxy replaces it with the real token. Do NOT read $APIFY_TOKEN from env, do NOT check .env, do NOT ask the user for an Apify key.

When to use Apify (vs Firecrawl/ScrapeCreators):

✅ China apps: 抖音, 小红书, 微博, B站, 京东, 淘宝, 1688, 闲鱼, 得物, 携程, 知乎, 豆瓣, 雪球, 快手, 爱奇艺, 优酷
✅ Southeast Asia e-commerce: Shopee, Lazada, Temu
❌ Western social media (TikTok/Instagram/YouTube/X/Reddit) → use ScrapeCreators first (cheaper)
❌ Generic web page scraping → use Firecrawl first (cheaper)
❌ Hard paywall articles → use archive_fallback (Apify doesn't help here)

How to pick an actor: The reliable-actors catalog is at output/apify_china_reliable.json (sorted by 30-day success count). Pick the top actor for the target platform. A few common ones:

Platform	Actor ID	Input key
抖音 search	`zen-studio~douyin-search-scraper`	`{"keywords": [...], "maxResultsPerQuery": N}`
小红书 search	`zen-studio~rednote-search-scraper`	`{"keywords": [...], "maxResults": N}`
小红书 note detail	`sian.agency~xiaohongshu-rednote-scraper`	`{"operation": "noteDetail", "noteId": "...", "xsecToken": "..."}`
微博 hot search	`gentle_cloud~weibo-hot-search-scraper`	`{"mode": "hot_band", "includeScores": true}`
微博 posts	`zhorex~weibo-scraper`	(see actor input schema)
B站 videos	`zhorex~bilibili-scraper`	(see actor input schema)
京东 search	`zen-studio~jd-com-search-scraper`	`{"keyword": "...", "maxProducts": N}`
京东 products	`sian.agency~jd-com-product-scraper`	`{"operation": "productSearch", "keyword": "...", "maxPages": 1}`
淘宝 products	`sian.agency~taobao-tmall-product-scraper`	`{"operation": "keywordSearch", "keyword": "...", "maxPages": 1}`
1688 wholesale	`zen-studio~1688-wholesale-scraper`	(see actor input schema)
闲鱼 search	`zen-studio~goofish-xianyu-search-scraper`	(see actor input schema)
TikTok	`clockworks~tiktok-scraper`	`{"hashtags": [...]}` or `{"profiles": [...]}`

Two-step Xiaohongshu workflow (search → note detail): The search scraper (zen-studio~rednote-search-scraper) returns only a truncated desc (~60 chars). For the full post body, run a second call with sian.agency~xiaohongshu-rednote-scraper in noteDetail mode, passing the id and xsec_token from the search result row. This is the only reliable way to get full note text for price/lodge/itinerary details.

Usage:

import sys
sys.path.insert(0, "/data/workspace/skills/web-crawler")
from exports import apify_run

# Sync run (small batches, ≤100 results): blocks until done, returns result list
results = apify_run("zen-studio~douyin-search-scraper",
                     {"keywords": ["MacBook"], "maxResultsPerQuery": 5})
for r in results:
    print(r.get("text", "")[:80])

# To discover the right input fields for an unfamiliar actor, fetch its
# input-schema page via Firecrawl first:
from exports import scrape_markdown
schema_md = scrape_markdown("https://apify.com/<user>/<actor>/input-schema")

Billing: Apify uses pay-per-event billing (actor-start + per-result + add-ons), billed at 2× the real Apify cost in credits. A typical small search (5–10 results) costs $0.02–0.05. The proxy auto-calculates the charge from the response — you don't need to estimate it yourself.

Error handling:

400 run-failed → bad input (wrong field name). Fetch the actor's input-schema page and fix.
401 → proxy misconfigured (should not happen). Report to user.
Empty result [] → actor ran but found nothing. Try different keywords or another actor.
Timeout → increase timeout param (default 180s). Some actors are slow.

Cost discipline: Apify actors are more expensive than Firecrawl/ScrapeCreators. Only use Apify when the cheaper options can't get the data (China apps, structured e-commerce fields). For a single web page, always try Firecrawl first.

What each service is for

ScrapeCreators — Social media data extraction (27+ platforms)

Use for any request involving social media profiles, posts, videos, comments, transcripts, search, ads, trending content, or engagement metrics. Covers TikTok, Instagram, YouTube, LinkedIn, Facebook, Twitter/X, Reddit, Threads, Bluesky, Pinterest, Snapchat, Twitch, Kick, Truth Social, TikTok Shop, Google search, and link-in-bio services (Linktree, Komi, Pillar, Linkbio, Linkme, Amazon Shop).

Base URL: https://api.scrapecreators.com Auth: No user-supplied key needed. sc-proxy injects platform credentials automatically — just send the request. The x-api-key header can be any value or omitted entirely. Do NOT bail out or ask the user for a key if $SCRAPECREATORS_API_KEY looks unset; that env var is intentionally not required. Method: All endpoints use GET requests with query params. Responses are JSON.

Firecrawl — Fallback web page scraper

Only a fallback crawler for one web page when ordinary fetching fails. Use POST /v2/scrape with a single url and focused formats like markdown, html, rawHtml, links, summary, or constrained json/question/highlights extraction.

Auth: No user-supplied key needed. sc-proxy injects the Firecrawl credential automatically when you call through core.http_client.proxied_post — just send the request. Do NOT read $FIRECRAWL_API_KEY from env, do NOT check .env for it, and do NOT ask the user for a Firecrawl key if it looks unset; that env var is intentionally not required. The same proxy-injection model as ScrapeCreators applies here.

Do not use Firecrawl crawl/map/search/agent/browser endpoints. Do not request screenshots, audio, branding, images, or browser actions unless the proxy policy is expanded later.

ScrapeCreators — Intent routing

Map user intent to the right endpoint. Endpoint paths use the pattern /v1/platform/action.

Important: After selecting an endpoint from the tables below, fetch its OpenAPI spec at https://docs.scrapecreators.com/{path}/openapi.json for full parameter details, types, and example response before making the actual API call. For example: https://docs.scrapecreators.com/v1/tiktok/profile/openapi.json

Profiles / User Info

Platform	Endpoint	Primary Param	Example
TikTok	`/v1/tiktok/profile`	handle	`stoolpresidente`
Instagram	`/v1/instagram/profile`	handle	`jane`
YouTube	`/v1/youtube/channel`	handle, channelId, or url	`ThePatMcAfeeShow`
LinkedIn (person)	`/v1/linkedin/profile`	url	`https://www.linkedin.com/in/parrsam/`
LinkedIn (company)	`/v1/linkedin/company`	url	`https://linkedin.com/company/shopify`
Facebook	`/v1/facebook/profile`	url	`https://www.facebook.com/mantraindianfolsom`
Twitter/X	`/v1/twitter/profile`	handle	`elonmusk`
Reddit	`/v1/reddit/subreddit/details`	subreddit or url	`AskReddit`
Threads	`/v1/threads/profile`	handle	`zuck`
Bluesky	`/v1/bluesky/profile`	handle	`jay.bsky.team`
Pinterest	`/v1/pinterest/user/boards`	handle	`pinterest`
Truth Social	`/v1/truthsocial/profile`	handle	`realDonaldTrump`
Twitch	`/v1/twitch/profile`	handle	`ninja`
Snapchat	`/v1/snapchat/profile`	handle	`djkhaled`

Posts / Content Feeds

Platform	Endpoint	Primary Param	Example
TikTok videos	`/v3/tiktok/profile/videos`	handle	`stoolpresidente`
Instagram posts	`/v2/instagram/user/posts`	handle	`jane`
Instagram reels	`/v1/instagram/user/reels`	handle or user_id	`jane` or `2700692569`
Instagram highlights	`/v1/instagram/user/highlights`	handle or user_id	`jane` or `2700692569`
YouTube videos	`/v1/youtube/channel/videos`	handle or channelId	`ThePatMcAfeeShow`
YouTube shorts	`/v1/youtube/channel/shorts`	handle or channelId	`starterstory`
YouTube playlist	`/v1/youtube/playlist`	playlist_id	`PLP32wGpgzmIlInfgKVFfCwVsxgGqZNIiS`
LinkedIn posts	`/v1/linkedin/company/posts`	url	`https://linkedin.com/company/shopify`
Facebook posts	`/v1/facebook/profile/posts`	url or pageId	`https://www.facebook.com/pacemorby`
Facebook reels	`/v1/facebook/profile/reels`	url	`https://www.facebook.com/Spurs`
Facebook photos	`/v1/facebook/profile/photos`	url	`https://www.facebook.com/Spurs`
Facebook group posts	`/v1/facebook/group/posts`	url or group_id	`742354120555345`
Twitter tweets	`/v1/twitter/user/tweets`	handle	`elonmusk`
Reddit posts	`/v1/reddit/subreddit`	subreddit	`AskReddit`
Threads posts	`/v1/threads/user/posts`	handle	`zuck`
Bluesky posts	`/v1/bluesky/user/posts`	handle or user_id	`jay.bsky.team`
Truth Social posts	`/v1/truthsocial/user/posts`	handle or user_id	`realDonaldTrump`
Pinterest board	`/v1/pinterest/board`	url	`https://www.pinterest.com/...`

Single Post / Video Details

Platform	Endpoint	Primary Param	Example
TikTok	`/v2/tiktok/video`	url	`https://www.tiktok.com/@randomspamvideos25/video/7251387037834595630`
Instagram	`/v1/instagram/post`	url	`https://www.instagram.com/reel/DOq6eV6iIgD`
Instagram highlight	`/v1/instagram/user/highlight/detail`	id	`18067016518767507`
YouTube	`/v1/youtube/video`	url	`https://www.youtube.com/watch?v=Y2Ah_DFr8cw`
YouTube community post	`/v1/youtube/community-post`	url	`https://www.youtube.com/post/Ugkxvj2KoApYAXoqLWnKVr6zZe5JjeHrQeP8`
LinkedIn	`/v1/linkedin/post`	url	`https://www.linkedin.com/pulse/being-father-has-made-me-better-leader...`
Facebook	`/v1/facebook/post`	url	`https://www.facebook.com/reel/1535656380759655`
Twitter/X	`/v1/twitter/tweet`	url	`https://twitter.com/elonmusk/status/...`
Twitter/X community	`/v1/twitter/community`	url	`https://twitter.com/i/communities/...`
Twitter/X community tweets	`/v1/twitter/community/tweets`	url	`https://twitter.com/i/communities/...`
Reddit	`/v1/reddit/post/comments`	url	`https://www.reddit.com/r/AskReddit/comments/...`
Threads	`/v1/threads/post`	url	`https://www.threads.net/@zuck/post/...`
Bluesky	`/v1/bluesky/post`	url	`https://bsky.app/profile/.../post/...`
Truth Social	`/v1/truthsocial/post`	url	`https://truthsocial.com/@realDonaldTrump/posts/...`
Pinterest	`/v1/pinterest/pin`	url	`https://www.pinterest.com/pin/...`
Twitch clip	`/v1/twitch/clip`	url	`https://clips.twitch.tv/...`
Kick clip	`/v1/kick/clip`	url	`https://kick.com/...`

Comments

Platform	Endpoint	Primary Param	Example
TikTok	`/v1/tiktok/video/comments`	url	`https://www.tiktok.com/@stoolpresidente/video/7499229683859426602`
Instagram	`/v2/instagram/post/comments`	url	`https://www.instagram.com/reel/DOq6eV6iIgD`
YouTube	`/v1/youtube/video/comments`	url	`https://www.youtube.com/watch?v=dQw4w9WgXcQ`
Facebook	`/v1/facebook/post/comments`	url or feedback_id	`https://www.facebook.com/reel/753347914167361`
Reddit	`/v1/reddit/post/comments`	url	`https://www.reddit.com/r/AskReddit/comments/...`

Transcripts

Platform	Endpoint	Example	Note
TikTok	`/v1/tiktok/video/transcript`	`url=https://www.tiktok.com/...&lang=en`	also via `/v2/tiktok/video` with get_transcript=true
Instagram	`/v2/instagram/media/transcript`	`url=https://www.instagram.com/reel/...`	AI-powered, 10-30s, under 2min
YouTube	`/v1/youtube/video/transcript`	`url=https://www.youtube.com/watch?v=bjVIDXPP7Uk`	also included in `/v1/youtube/video` response
Facebook	`/v1/facebook/post/transcript`	`url=https://www.facebook.com/reel/...`	under 2min only
Twitter/X	`/v1/twitter/tweet/transcript`	`url=https://twitter.com/...`	AI-powered, slow

Search

Platform	Endpoint	Primary Param	Example
TikTok users	`/v1/tiktok/search/users`	query	`funny`
TikTok videos (keyword)	`/v1/tiktok/search/keyword`	query	`funny`
TikTok videos (hashtag)	`/v1/tiktok/search/hashtag`	hashtag	`fyp`
TikTok top (photos+videos)	`/v1/tiktok/search/top`	query	`funny`
Instagram reels	`/v2/instagram/reels/search`	query	`dogs`
YouTube	`/v1/youtube/search`	query	`funny`
YouTube hashtag	`/v1/youtube/search/hashtag`	hashtag	`funny`
Reddit (all)	`/v1/reddit/search`	query	`best programming languages`
Reddit (in subreddit)	`/v1/reddit/subreddit/search`	subreddit + query	`AskReddit` + `funny`
Threads posts	`/v1/threads/search`	query	`AI`
Threads users	`/v1/threads/search/users`	query	`zuck`
Pinterest	`/v1/pinterest/search`	query	`home decor`
Google	`/v1/google/search`	query	`best restaurants in NYC`

Ad Libraries

Platform	Endpoint	Primary Param	Example
Facebook ads search	`/v1/facebook/adLibrary/search/ads`	query	`running`
Facebook company ads	`/v1/facebook/adLibrary/company/ads`	pageId or companyName	`Lululemon`
Facebook ad detail	`/v1/facebook/adLibrary/ad`	id or url	`702369045530963`
Facebook find companies	`/v1/facebook/adLibrary/search/companies`	query	`Nike`
Google company ads	`/v1/google/company/ads`	domain or advertiser_id	`nike.com`
Google ad detail	`/v1/google/ad`	url	`https://adstransparency.google.com/...`
Google find advertisers	`/v1/google/adLibrary/advertisers/search`	query	`Nike`
LinkedIn ads search	`/v1/linkedin/ads/search`	company or keyword	`Shopify`
LinkedIn ad detail	`/v1/linkedin/ad`	url	`https://www.linkedin.com/ad/...`
Reddit ads search	`/v1/reddit/ads/search`	query	`gaming`
Reddit ad detail	`/v1/reddit/ad`	id	`t3_abc123`

Trending / Popular

Content	Endpoint	Param	Example
Trending feed	`/v1/tiktok/get-trending-feed`	region (required)	`US`
Popular videos	`/v1/tiktok/videos/popular`
Popular creators	`/v1/tiktok/creators/popular`
Popular hashtags	`/v1/tiktok/hashtags/popular`
Popular songs	`/v1/tiktok/songs/popular`
Song details	`/v1/tiktok/song`	clipId	`7439295283975702544`
Videos using song	`/v1/tiktok/song/videos`	clipId	`7439295283975702544`
Trending shorts (YT)	`/v1/youtube/shorts/trending`

Followers / Following / Live (TikTok only)

Type	Endpoint	Example
Following	`/v1/tiktok/user/following`	`handle=stoolpresidente`
Followers	`/v1/tiktok/user/followers`	`handle=stoolpresidente`
Audience demographics	`/v1/tiktok/user/audience` (26 credits!)	`handle=shakira`
Live stream	`/v1/tiktok/user/live`	`handle=thejustalex`

TikTok Shop

Type	Endpoint	Primary Param	Example
Search products	`/v1/tiktok/shop/search`	query	`shoes`
Store products	`/v1/tiktok/shop/products`	url	`https://www.tiktok.com/shop/store/goli-nutrition/7495794203056835079`
Product detail	`/v1/tiktok/product`	url	`https://www.tiktok.com/shop/pdp/goli-ashwagandha-gummies.../1729587769570529799`
Product reviews	`/v1/tiktok/shop/product/reviews`	url or product_id	`1731578642912612516`
User showcase	`/v1/tiktok/user/showcase`	handle	`mrtiktokreviews`

Link-in-Bio / Other

Service	Endpoint	Param	Example
Linktree	`/v1/linktree`	url	`https://linktr.ee/...`
Komi	`/v1/komi`	url	`https://komi.io/...`
Pillar	`/v1/pillar`	url	`https://pillar.io/...`
Linkbio	`/v1/linkbio`	url	`https://linkbio.co/...`
Linkme	`/v1/linkme`	url	`https://linkme.bio/...`
Amazon Shop	`/v1/amazon/shop`	url	`https://www.amazon.com/shop/...`
Instagram basic profile	`/v1/instagram/basic/profile`	userId	`314216`
Instagram embed HTML	`/v1/instagram/user/embed`	handle	`jane`
Age/Gender detect	`/v1/detect/age-gender`	url (social profile)	`https://www.tiktok.com/@charlidamelio`
Credit balance	`/v1/credit/balance`	(none)

ScrapeCreators pagination

Paginated endpoints return a cursor/token in the response. Pass it back as a query param to get the next page.

Cursor Field	Used By
`cursor`	TikTok comments/search/song videos, Instagram comments, Reddit subreddit search, Pinterest, Bluesky, Facebook reels/photos/posts/comments, TikTok Shop products/user showcase
`max_cursor`	TikTok profile videos
`min_time`	TikTok following/followers
`continuationToken`	YouTube (all paginated endpoints)
`after`	Reddit posts, Reddit search
`next_max_id`	Instagram posts, Truth Social posts
`max_id`	Instagram reels
`page`	TikTok popular/shop, Instagram reels search, LinkedIn company posts, TikTok Shop reviews
`paginationToken`	LinkedIn ads

ScrapeCreators known limitations

Handles: pass without the @ symbol. Use charlidamelio not @charlidamelio. Applies to TikTok, Instagram, Twitter, Threads, Bluesky, Snapchat, Twitch, Pinterest, Truth Social
YouTube handles: pass without the @ symbol. Use ThePatMcAfeeShow not @ThePatMcAfeeShow. You can also pass a channelId or full URL instead
Hashtags: pass without the # symbol. Use fyp not #fyp. Applies to TikTok and YouTube hashtag search endpoints
Twitter: returns ~100 most popular tweets, not chronological/latest
Threads: only last 20-30 posts visible publicly
Facebook posts: only 3 posts per page (API limitation)
Facebook group posts: only 3 posts per page (same limitation)
LinkedIn company posts: max 7 pages
Instagram play counts: IG-only views (excludes cross-posted FB views)
Truth Social: only prominent users (Trump, Vance, etc.) work publicly
Transcripts: all transcript endpoints require video under 2 minutes
Reddit subreddit names: case-sensitive! Use "AskReddit" not "askreddit"

Access patterns

ScrapeCreators (social media)

Use Python with core.http_client.proxied_get so sc-proxy injects credentials and bills correctly. Include a typed SC-CALLER-ID header (chat:, job:, preview:, etc.) for cost tracking. Do not read $SCRAPECREATORS_API_KEY from env and do not ask the user for a key — the proxy handles it.

from core.http_client import proxied_get

headers = {"SC-CALLER-ID": "chat:youtube-transcript"}

transcript = proxied_get(
    "https://api.scrapecreators.com/v1/youtube/video/transcript",
    params={"url": "https://www.youtube.com/watch?v=VIDEO_ID", "language": "en"},
    headers=headers,
    timeout=20,
).json()

Bash/curl works too (proxy is transparent), but Python is preferred for cost tracking:

curl -s "https://api.scrapecreators.com/v1/tiktok/profile?handle=charlidamelio" \
  -H "x-api-key: any"

Each endpoint has its own OpenAPI spec at https://docs.scrapecreators.com/{path}/openapi.json. Always fetch the per-endpoint spec first to get full parameter details before making the actual API call. The full spec is at https://docs.scrapecreators.com/openapi.json (large file — prefer per-endpoint specs).

Common optional params:

trim (boolean): reduces response payload size. Use when you only need key metrics.
region (string): 2-letter country code for proxy location. Does NOT filter by region — just routes through that country's proxy.

Firecrawl (web page fallback via transparent proxy)

No Firecrawl API key required — sc-proxy injects it. Don't look it up in .env or ask the user. Just call:

from core.http_client import proxied_post

headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}

page = proxied_post(
    "https://api.firecrawl.dev/v2/scrape",
    json={
        "url": "https://example.com/article",
        "formats": ["markdown", "links"],
        "onlyMainContent": True,
        "timeout": 60000,
    },
    headers=headers,
).json()

Decision rules

Route every request to the right backend. The user should never need to specify which API to use.

Social media request (profile, posts, comments, search, ads, trending, transcripts)

Use ScrapeCreators. Match the user's intent to an endpoint from the routing tables above. Strip @ from handles and # from hashtags before calling. Fetch the per-endpoint OpenAPI spec first for full param details.

YouTube URL or YouTube content request

Use ScrapeCreators for YouTube (channel info, videos, shorts, playlists, comments, transcripts, search, trending shorts). Match the user's intent to the appropriate YouTube endpoint from the routing tables above.

For a YouTube URL, use /v1/youtube/video to get video details and transcript. If the user's goal is content analysis, summarization, quote extraction, or topic mining, the transcript is included in the video response.

For a YouTube topic query, use /v1/youtube/search to find relevant videos, then call /v1/youtube/video only for the videos needed. Avoid fetching many videos by default.

Blocked or JS-heavy web page

Use Firecrawl once with formats:["markdown","links"] and onlyMainContent:true. Treat the returned Markdown as the extraction substrate, not as final truth: parse the title, price/value fields, specs, body description, image URLs, outbound links, and obvious contact/location hints from the page structure.

General web-page extraction lessons:

Many listing/detail pages render important content with JavaScript, image galleries, hidden sections, or repeated UI labels. web_fetch may return boilerplate while Firecrawl can still recover the real main content.
Do not hard-code site-specific labels. Convert page text into a generic structured summary: what it is, where it is, key numbers, evidence snippets, media/links, and caveats.
Preserve source URLs for images and links when they help verify the page, but do not download or batch-process every media asset unless the user asks.
If Markdown misses important layout or structured fields, retry once with rawHtml; use json, question, or highlights only when the user asked for narrow extraction and the schema/prompt is specific.

Cost discipline

ScrapeCreators — most endpoints cost 1 credit per request. Exceptions: /v1/tiktok/user/audience costs 26 credits; /v1/tiktok/video/transcript with use_ai_as_fallback=true costs +10 credits; /v1/google/company/ads with get_ad_details=true costs 25 credits. Warn users before calling expensive endpoints.

Firecrawl — billed per page plus expensive modifiers.

Keep calls tight: one page, one video, or a small shortlist. Never batch-crawl whole websites or bulk-scrape entire feeds with this skill.

If the proxy returns 403, the request is outside the allowed use case. Change the approach instead of retrying.

If the proxy returns 429, back off; do not parallelize around the limit.

If the upstream returns a failure, report the exact failure and avoid repeated paid retries unless one parameter change is clearly justified.

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Preferred entry: call exports.py (don't hand-roll requests)

import sys; sys.path.insert(0, "/data/workspace/skills/web-crawler")
from exports import scrape_markdown, youtube_transcript, sc_get
scrape_markdown("https://example.com/article")          # Firecrawl fallback
youtube_transcript("https://youtube.com/watch?v=ID")     # ScrapeCreators
sc_get("/v1/tiktok/profile", handle="charlidamelio")     # any SC endpoint

from exports import archive_fallback                      # paywall / Firecrawl-403
archive_fallback("https://www.nytimes.com/.../article.html")  # archive snapshot

Quick trigger rules (read this first)

Use this skill immediately when any of these conditions is true:

web_fetch returns HTTP 401/403/429/5xx
Response is an anti-bot challenge page (for example Cloudflare "Attention Required", "Just a moment", or challenge/captcha pages)
The page is JS-heavy and the first fetch misses required detail fields (for example publish time, author, listing code, updated time, price breakdown)
Search results do not contain the requested field and the value must be extracted from the target page itself

Fallback rule:

If ordinary fetch is blocked or incomplete, switch to Firecrawl fallback in this skill before asking the user for screenshots/manual text.

Error signatures -> action

Signature	Action
`web_fetch` HTTP 401/403/429/5xx	Call Firecrawl `POST /v2/scrape` once with `formats:["markdown","links"]` + `onlyMainContent:true`
Cloudflare/challenge page text in body	Same Firecrawl call as above
Markdown still misses key fields	Retry once with `formats:["rawHtml"]`
Firecrawl returns 403 / empty for a social media URL	Check the intent-routing tables below for a ScrapeCreators platform-specific endpoint for this domain (e.g. `sc_get('/v1/instagram/post', url=...)`, `sc_get('/v2/tiktok/video', url=...)`). If one exists, use it — these have dedicated extraction that bypasses anti-bot. If no dedicated endpoint exists, fall through to `archive_fallback` or ask the user.
Firecrawl itself returns 403 / empty (hard paywall: NYT, WSJ, Economist, FT, Bloomberg)	Call `archive_fallback(url)` — recovers full text from a web archive snapshot
Need structured data from a China app (抖音/小红书/微博/B站/京东/淘宝/1688/闲鱼/得物 etc.)	Call `apify_run()` — Apify Store has purpose-built actors for these platforms that Firecrawl/ScrapeCreators don't cover

Paywall / Firecrawl-blocked fallback chain (use `archive_fallback`)

from exports import archive_fallback
res = archive_fallback("https://www.nytimes.com/.../article.html")
res["markdown"]      # full text, or "" if no snapshot exists anywhere
res["source"]        # "archive.today" | "wayback" | None
res["snapshot_url"]  # the snapshot that was scraped

How it works (and why this order):

archive.today first (archive.ph / archive.is mirrors). User-triggered, real-browser captures; historically preserves full text behind paywalls. Best bet for NYT/WSJ/Economist. We scrape its /newest/ snapshot via Firecrawl (archive.today has its own Cloudflare, so scrape it through Firecrawl, never web_fetch it directly).
Wayback Machine second (archive.org). Automated crawler that honors robots.txt and paywalls, so it often has NO full text for hard paywalls — but it's a good fallback for ordinary 403/Cloudflare pages that aren't paywalled.

China-app structured data (use `apify_run`)

When to use Apify (vs Firecrawl/ScrapeCreators):

✅ China apps: 抖音, 小红书, 微博, B站, 京东, 淘宝, 1688, 闲鱼, 得物, 携程, 知乎, 豆瓣, 雪球, 快手, 爱奇艺, 优酷
✅ Southeast Asia e-commerce: Shopee, Lazada, Temu
❌ Western social media (TikTok/Instagram/YouTube/X/Reddit) → use ScrapeCreators first (cheaper)
❌ Generic web page scraping → use Firecrawl first (cheaper)
❌ Hard paywall articles → use archive_fallback (Apify doesn't help here)

How to pick an actor: The reliable-actors catalog is at output/apify_china_reliable.json (sorted by 30-day success count). Pick the top actor for the target platform. A few common ones:

Platform	Actor ID	Input key
抖音 search	`zen-studio~douyin-search-scraper`	`{"keywords": [...], "maxResultsPerQuery": N}`
小红书 search	`zen-studio~rednote-search-scraper`	`{"keywords": [...], "maxResults": N}`
小红书 note detail	`sian.agency~xiaohongshu-rednote-scraper`	`{"operation": "noteDetail", "noteId": "...", "xsecToken": "..."}`
微博 hot search	`gentle_cloud~weibo-hot-search-scraper`	`{"mode": "hot_band", "includeScores": true}`
微博 posts	`zhorex~weibo-scraper`	(see actor input schema)
B站 videos	`zhorex~bilibili-scraper`	(see actor input schema)
京东 search	`zen-studio~jd-com-search-scraper`	`{"keyword": "...", "maxProducts": N}`
京东 products	`sian.agency~jd-com-product-scraper`	`{"operation": "productSearch", "keyword": "...", "maxPages": 1}`
淘宝 products	`sian.agency~taobao-tmall-product-scraper`	`{"operation": "keywordSearch", "keyword": "...", "maxPages": 1}`
1688 wholesale	`zen-studio~1688-wholesale-scraper`	(see actor input schema)
闲鱼 search	`zen-studio~goofish-xianyu-search-scraper`	(see actor input schema)
TikTok	`clockworks~tiktok-scraper`	`{"hashtags": [...]}` or `{"profiles": [...]}`

Usage:

import sys
sys.path.insert(0, "/data/workspace/skills/web-crawler")
from exports import apify_run

# Sync run (small batches, ≤100 results): blocks until done, returns result list
results = apify_run("zen-studio~douyin-search-scraper",
                     {"keywords": ["MacBook"], "maxResultsPerQuery": 5})
for r in results:
    print(r.get("text", "")[:80])

# To discover the right input fields for an unfamiliar actor, fetch its
# input-schema page via Firecrawl first:
from exports import scrape_markdown
schema_md = scrape_markdown("https://apify.com/<user>/<actor>/input-schema")

Error handling:

400 run-failed → bad input (wrong field name). Fetch the actor's input-schema page and fix.
401 → proxy misconfigured (should not happen). Report to user.
Empty result [] → actor ran but found nothing. Try different keywords or another actor.
Timeout → increase timeout param (default 180s). Some actors are slow.

What each service is for

ScrapeCreators — Social media data extraction (27+ platforms)

Firecrawl — Fallback web page scraper

Do not use Firecrawl crawl/map/search/agent/browser endpoints. Do not request screenshots, audio, branding, images, or browser actions unless the proxy policy is expanded later.

ScrapeCreators — Intent routing

Map user intent to the right endpoint. Endpoint paths use the pattern /v1/platform/action.

Profiles / User Info

Platform	Endpoint	Primary Param	Example
TikTok	`/v1/tiktok/profile`	handle	`stoolpresidente`
Instagram	`/v1/instagram/profile`	handle	`jane`
YouTube	`/v1/youtube/channel`	handle, channelId, or url	`ThePatMcAfeeShow`
LinkedIn (person)	`/v1/linkedin/profile`	url	`https://www.linkedin.com/in/parrsam/`
LinkedIn (company)	`/v1/linkedin/company`	url	`https://linkedin.com/company/shopify`
Facebook	`/v1/facebook/profile`	url	`https://www.facebook.com/mantraindianfolsom`
Twitter/X	`/v1/twitter/profile`	handle	`elonmusk`
Reddit	`/v1/reddit/subreddit/details`	subreddit or url	`AskReddit`
Threads	`/v1/threads/profile`	handle	`zuck`
Bluesky	`/v1/bluesky/profile`	handle	`jay.bsky.team`
Pinterest	`/v1/pinterest/user/boards`	handle	`pinterest`
Truth Social	`/v1/truthsocial/profile`	handle	`realDonaldTrump`
Twitch	`/v1/twitch/profile`	handle	`ninja`
Snapchat	`/v1/snapchat/profile`	handle	`djkhaled`

Posts / Content Feeds

Platform	Endpoint	Primary Param	Example
TikTok videos	`/v3/tiktok/profile/videos`	handle	`stoolpresidente`
Instagram posts	`/v2/instagram/user/posts`	handle	`jane`
Instagram reels	`/v1/instagram/user/reels`	handle or user_id	`jane` or `2700692569`
Instagram highlights	`/v1/instagram/user/highlights`	handle or user_id	`jane` or `2700692569`
YouTube videos	`/v1/youtube/channel/videos`	handle or channelId	`ThePatMcAfeeShow`
YouTube shorts	`/v1/youtube/channel/shorts`	handle or channelId	`starterstory`
YouTube playlist	`/v1/youtube/playlist`	playlist_id	`PLP32wGpgzmIlInfgKVFfCwVsxgGqZNIiS`
LinkedIn posts	`/v1/linkedin/company/posts`	url	`https://linkedin.com/company/shopify`
Facebook posts	`/v1/facebook/profile/posts`	url or pageId	`https://www.facebook.com/pacemorby`
Facebook reels	`/v1/facebook/profile/reels`	url	`https://www.facebook.com/Spurs`
Facebook photos	`/v1/facebook/profile/photos`	url	`https://www.facebook.com/Spurs`
Facebook group posts	`/v1/facebook/group/posts`	url or group_id	`742354120555345`
Twitter tweets	`/v1/twitter/user/tweets`	handle	`elonmusk`
Reddit posts	`/v1/reddit/subreddit`	subreddit	`AskReddit`
Threads posts	`/v1/threads/user/posts`	handle	`zuck`
Bluesky posts	`/v1/bluesky/user/posts`	handle or user_id	`jay.bsky.team`
Truth Social posts	`/v1/truthsocial/user/posts`	handle or user_id	`realDonaldTrump`
Pinterest board	`/v1/pinterest/board`	url	`https://www.pinterest.com/...`

Single Post / Video Details

Platform	Endpoint	Primary Param	Example
TikTok	`/v2/tiktok/video`	url	`https://www.tiktok.com/@randomspamvideos25/video/7251387037834595630`
Instagram	`/v1/instagram/post`	url	`https://www.instagram.com/reel/DOq6eV6iIgD`
Instagram highlight	`/v1/instagram/user/highlight/detail`	id	`18067016518767507`
YouTube	`/v1/youtube/video`	url	`https://www.youtube.com/watch?v=Y2Ah_DFr8cw`
YouTube community post	`/v1/youtube/community-post`	url	`https://www.youtube.com/post/Ugkxvj2KoApYAXoqLWnKVr6zZe5JjeHrQeP8`
LinkedIn	`/v1/linkedin/post`	url	`https://www.linkedin.com/pulse/being-father-has-made-me-better-leader...`
Facebook	`/v1/facebook/post`	url	`https://www.facebook.com/reel/1535656380759655`
Twitter/X	`/v1/twitter/tweet`	url	`https://twitter.com/elonmusk/status/...`
Twitter/X community	`/v1/twitter/community`	url	`https://twitter.com/i/communities/...`
Twitter/X community tweets	`/v1/twitter/community/tweets`	url	`https://twitter.com/i/communities/...`
Reddit	`/v1/reddit/post/comments`	url	`https://www.reddit.com/r/AskReddit/comments/...`
Threads	`/v1/threads/post`	url	`https://www.threads.net/@zuck/post/...`
Bluesky	`/v1/bluesky/post`	url	`https://bsky.app/profile/.../post/...`
Truth Social	`/v1/truthsocial/post`	url	`https://truthsocial.com/@realDonaldTrump/posts/...`
Pinterest	`/v1/pinterest/pin`	url	`https://www.pinterest.com/pin/...`
Twitch clip	`/v1/twitch/clip`	url	`https://clips.twitch.tv/...`
Kick clip	`/v1/kick/clip`	url	`https://kick.com/...`

Comments

Platform	Endpoint	Primary Param	Example
TikTok	`/v1/tiktok/video/comments`	url	`https://www.tiktok.com/@stoolpresidente/video/7499229683859426602`
Instagram	`/v2/instagram/post/comments`	url	`https://www.instagram.com/reel/DOq6eV6iIgD`
YouTube	`/v1/youtube/video/comments`	url	`https://www.youtube.com/watch?v=dQw4w9WgXcQ`
Facebook	`/v1/facebook/post/comments`	url or feedback_id	`https://www.facebook.com/reel/753347914167361`
Reddit	`/v1/reddit/post/comments`	url	`https://www.reddit.com/r/AskReddit/comments/...`

Transcripts

Platform	Endpoint	Example	Note
TikTok	`/v1/tiktok/video/transcript`	`url=https://www.tiktok.com/...&lang=en`	also via `/v2/tiktok/video` with get_transcript=true
Instagram	`/v2/instagram/media/transcript`	`url=https://www.instagram.com/reel/...`	AI-powered, 10-30s, under 2min
YouTube	`/v1/youtube/video/transcript`	`url=https://www.youtube.com/watch?v=bjVIDXPP7Uk`	also included in `/v1/youtube/video` response
Facebook	`/v1/facebook/post/transcript`	`url=https://www.facebook.com/reel/...`	under 2min only
Twitter/X	`/v1/twitter/tweet/transcript`	`url=https://twitter.com/...`	AI-powered, slow

Search

Platform	Endpoint	Primary Param	Example
TikTok users	`/v1/tiktok/search/users`	query	`funny`
TikTok videos (keyword)	`/v1/tiktok/search/keyword`	query	`funny`
TikTok videos (hashtag)	`/v1/tiktok/search/hashtag`	hashtag	`fyp`
TikTok top (photos+videos)	`/v1/tiktok/search/top`	query	`funny`
Instagram reels	`/v2/instagram/reels/search`	query	`dogs`
YouTube	`/v1/youtube/search`	query	`funny`
YouTube hashtag	`/v1/youtube/search/hashtag`	hashtag	`funny`
Reddit (all)	`/v1/reddit/search`	query	`best programming languages`
Reddit (in subreddit)	`/v1/reddit/subreddit/search`	subreddit + query	`AskReddit` + `funny`
Threads posts	`/v1/threads/search`	query	`AI`
Threads users	`/v1/threads/search/users`	query	`zuck`
Pinterest	`/v1/pinterest/search`	query	`home decor`
Google	`/v1/google/search`	query	`best restaurants in NYC`

Ad Libraries

Platform	Endpoint	Primary Param	Example
Facebook ads search	`/v1/facebook/adLibrary/search/ads`	query	`running`
Facebook company ads	`/v1/facebook/adLibrary/company/ads`	pageId or companyName	`Lululemon`
Facebook ad detail	`/v1/facebook/adLibrary/ad`	id or url	`702369045530963`
Facebook find companies	`/v1/facebook/adLibrary/search/companies`	query	`Nike`
Google company ads	`/v1/google/company/ads`	domain or advertiser_id	`nike.com`
Google ad detail	`/v1/google/ad`	url	`https://adstransparency.google.com/...`
Google find advertisers	`/v1/google/adLibrary/advertisers/search`	query	`Nike`
LinkedIn ads search	`/v1/linkedin/ads/search`	company or keyword	`Shopify`
LinkedIn ad detail	`/v1/linkedin/ad`	url	`https://www.linkedin.com/ad/...`
Reddit ads search	`/v1/reddit/ads/search`	query	`gaming`
Reddit ad detail	`/v1/reddit/ad`	id	`t3_abc123`

Trending / Popular

Content	Endpoint	Param	Example
Trending feed	`/v1/tiktok/get-trending-feed`	region (required)	`US`
Popular videos	`/v1/tiktok/videos/popular`
Popular creators	`/v1/tiktok/creators/popular`
Popular hashtags	`/v1/tiktok/hashtags/popular`
Popular songs	`/v1/tiktok/songs/popular`
Song details	`/v1/tiktok/song`	clipId	`7439295283975702544`
Videos using song	`/v1/tiktok/song/videos`	clipId	`7439295283975702544`
Trending shorts (YT)	`/v1/youtube/shorts/trending`

Followers / Following / Live (TikTok only)

Type	Endpoint	Example
Following	`/v1/tiktok/user/following`	`handle=stoolpresidente`
Followers	`/v1/tiktok/user/followers`	`handle=stoolpresidente`
Audience demographics	`/v1/tiktok/user/audience` (26 credits!)	`handle=shakira`
Live stream	`/v1/tiktok/user/live`	`handle=thejustalex`

TikTok Shop

Type	Endpoint	Primary Param	Example
Search products	`/v1/tiktok/shop/search`	query	`shoes`
Store products	`/v1/tiktok/shop/products`	url	`https://www.tiktok.com/shop/store/goli-nutrition/7495794203056835079`
Product detail	`/v1/tiktok/product`	url	`https://www.tiktok.com/shop/pdp/goli-ashwagandha-gummies.../1729587769570529799`
Product reviews	`/v1/tiktok/shop/product/reviews`	url or product_id	`1731578642912612516`
User showcase	`/v1/tiktok/user/showcase`	handle	`mrtiktokreviews`

Link-in-Bio / Other

Service	Endpoint	Param	Example
Linktree	`/v1/linktree`	url	`https://linktr.ee/...`
Komi	`/v1/komi`	url	`https://komi.io/...`
Pillar	`/v1/pillar`	url	`https://pillar.io/...`
Linkbio	`/v1/linkbio`	url	`https://linkbio.co/...`
Linkme	`/v1/linkme`	url	`https://linkme.bio/...`
Amazon Shop	`/v1/amazon/shop`	url	`https://www.amazon.com/shop/...`
Instagram basic profile	`/v1/instagram/basic/profile`	userId	`314216`
Instagram embed HTML	`/v1/instagram/user/embed`	handle	`jane`
Age/Gender detect	`/v1/detect/age-gender`	url (social profile)	`https://www.tiktok.com/@charlidamelio`
Credit balance	`/v1/credit/balance`	(none)

ScrapeCreators pagination

Paginated endpoints return a cursor/token in the response. Pass it back as a query param to get the next page.

Cursor Field	Used By
`cursor`	TikTok comments/search/song videos, Instagram comments, Reddit subreddit search, Pinterest, Bluesky, Facebook reels/photos/posts/comments, TikTok Shop products/user showcase
`max_cursor`	TikTok profile videos
`min_time`	TikTok following/followers
`continuationToken`	YouTube (all paginated endpoints)
`after`	Reddit posts, Reddit search
`next_max_id`	Instagram posts, Truth Social posts
`max_id`	Instagram reels
`page`	TikTok popular/shop, Instagram reels search, LinkedIn company posts, TikTok Shop reviews
`paginationToken`	LinkedIn ads

ScrapeCreators known limitations

Handles: pass without the @ symbol. Use charlidamelio not @charlidamelio. Applies to TikTok, Instagram, Twitter, Threads, Bluesky, Snapchat, Twitch, Pinterest, Truth Social
YouTube handles: pass without the @ symbol. Use ThePatMcAfeeShow not @ThePatMcAfeeShow. You can also pass a channelId or full URL instead
Hashtags: pass without the # symbol. Use fyp not #fyp. Applies to TikTok and YouTube hashtag search endpoints
Twitter: returns ~100 most popular tweets, not chronological/latest
Threads: only last 20-30 posts visible publicly
Facebook posts: only 3 posts per page (API limitation)
Facebook group posts: only 3 posts per page (same limitation)
LinkedIn company posts: max 7 pages
Instagram play counts: IG-only views (excludes cross-posted FB views)
Truth Social: only prominent users (Trump, Vance, etc.) work publicly
Transcripts: all transcript endpoints require video under 2 minutes
Reddit subreddit names: case-sensitive! Use "AskReddit" not "askreddit"

Access patterns

ScrapeCreators (social media)

from core.http_client import proxied_get

headers = {"SC-CALLER-ID": "chat:youtube-transcript"}

transcript = proxied_get(
    "https://api.scrapecreators.com/v1/youtube/video/transcript",
    params={"url": "https://www.youtube.com/watch?v=VIDEO_ID", "language": "en"},
    headers=headers,
    timeout=20,
).json()

Bash/curl works too (proxy is transparent), but Python is preferred for cost tracking:

curl -s "https://api.scrapecreators.com/v1/tiktok/profile?handle=charlidamelio" \
  -H "x-api-key: any"

Common optional params:

trim (boolean): reduces response payload size. Use when you only need key metrics.
region (string): 2-letter country code for proxy location. Does NOT filter by region — just routes through that country's proxy.

Firecrawl (web page fallback via transparent proxy)

No Firecrawl API key required — sc-proxy injects it. Don't look it up in .env or ask the user. Just call:

from core.http_client import proxied_post

headers = {"SC-CALLER-ID": "chat:web-crawl-fallback"}

page = proxied_post(
    "https://api.firecrawl.dev/v2/scrape",
    json={
        "url": "https://example.com/article",
        "formats": ["markdown", "links"],
        "onlyMainContent": True,
        "timeout": 60000,
    },
    headers=headers,
).json()

Decision rules

Route every request to the right backend. The user should never need to specify which API to use.

Social media request (profile, posts, comments, search, ads, trending, transcripts)

YouTube URL or YouTube content request

For a YouTube topic query, use /v1/youtube/search to find relevant videos, then call /v1/youtube/video only for the videos needed. Avoid fetching many videos by default.

Blocked or JS-heavy web page

General web-page extraction lessons:

Many listing/detail pages render important content with JavaScript, image galleries, hidden sections, or repeated UI labels. web_fetch may return boilerplate while Firecrawl can still recover the real main content.
Do not hard-code site-specific labels. Convert page text into a generic structured summary: what it is, where it is, key numbers, evidence snippets, media/links, and caveats.
Preserve source URLs for images and links when they help verify the page, but do not download or batch-process every media asset unless the user asks.
If Markdown misses important layout or structured fields, retry once with rawHtml; use json, question, or highlights only when the user asked for narrow extraction and the schema/prompt is specific.

Cost discipline

Firecrawl — billed per page plus expensive modifiers.

Keep calls tight: one page, one video, or a small shortlist. Never batch-crawl whole websites or bulk-scrape entire feeds with this skill.

If the proxy returns 403, the request is outside the allowed use case. Change the approach instead of retrying.

If the proxy returns 429, back off; do not parallelize around the limit.

If the upstream returns a failure, report the exact failure and avoid repeated paid retries unless one parameter change is clearly justified.

Web Crawler

Install to Claude Code

Preferred entry: call exports.py (don't hand-roll requests)

Quick trigger rules (read this first)

Error signatures -> action

Paywall / Firecrawl-blocked fallback chain (use archive_fallback)

China-app structured data (use apify_run)

What each service is for

ScrapeCreators — Social media data extraction (27+ platforms)

Firecrawl — Fallback web page scraper

ScrapeCreators — Intent routing

Profiles / User Info

Posts / Content Feeds

Single Post / Video Details

Comments

Transcripts

Search

Ad Libraries

Trending / Popular

Followers / Following / Live (TikTok only)

TikTok Shop

Link-in-Bio / Other

ScrapeCreators pagination

ScrapeCreators known limitations

Access patterns

ScrapeCreators (social media)

Firecrawl (web page fallback via transparent proxy)

Decision rules

Social media request (profile, posts, comments, search, ads, trending, transcripts)

YouTube URL or YouTube content request

Blocked or JS-heavy web page

Cost discipline

Web Crawler

Install to Claude Code

Preferred entry: call exports.py (don't hand-roll requests)

Quick trigger rules (read this first)

Error signatures -> action

Paywall / Firecrawl-blocked fallback chain (use archive_fallback)

China-app structured data (use apify_run)

What each service is for

ScrapeCreators — Social media data extraction (27+ platforms)

Firecrawl — Fallback web page scraper

ScrapeCreators — Intent routing

Profiles / User Info

Posts / Content Feeds

Single Post / Video Details

Comments

Transcripts

Search

Ad Libraries

Trending / Popular

Followers / Following / Live (TikTok only)

TikTok Shop

Link-in-Bio / Other

ScrapeCreators pagination

ScrapeCreators known limitations

Access patterns

ScrapeCreators (social media)

Firecrawl (web page fallback via transparent proxy)

Decision rules

Social media request (profile, posts, comments, search, ads, trending, transcripts)

YouTube URL or YouTube content request

Blocked or JS-heavy web page

Cost discipline

Recommended

Recommended

Paywall / Firecrawl-blocked fallback chain (use `archive_fallback`)

China-app structured data (use `apify_run`)

Paywall / Firecrawl-blocked fallback chain (use `archive_fallback`)

China-app structured data (use `apify_run`)