webEmbedding

26 toolsSTDIO, HTTPregistry active

Summary

This is a Playwright-based URL cloning engine that tries to reuse embeddable sources before rebuilding blocked pages from captured DOM, styles, assets, and HAR network traces. It exposes MCP tools for URL inspection, clone route classification, live browser capture, and visual/DOM/computed-style verification across desktop, tablet, and mobile breakpoints. The stdio server runs full capture and rebuild locally, while the hosted endpoint at webembedding-mcp.vercel.app provides read-only routing helpers for Apps SDK integrations. Reach for it when you need to recreate marketing pages, documentation sites, or iframe-blocked surfaces with self-verified fidelity scores, or when you want HAR replay and responsive breakpoint evidence instead of raw screenshots.

Install to Claude Code

verified

claude mcp add --transport http webembedding https://webembedding-mcp.vercel.app/mcp

Run in your terminal. Add --scope user to make it available in every project.

Review the command, arguments, and environment values before installing — MCP servers run with your local permissions.

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

Tools

Verified live against the running server on Jun 10, 2026.

verified live6 tools

detect_runtime_capabilitiesReport the hosted Apps SDK intake runtime capabilities and explain when the local stdio MCP is required.

Report the hosted Apps SDK intake runtime capabilities and explain when the local stdio MCP is required.

No parameters — call it with no arguments.

inspect_urlFetch a public or user-authorized URL and inspect title, metadata, frame policy, and likely source/embed candidates. Does not capture screenshots or persist artifacts.2 params

Fetch a public or user-authorized URL and inspect title, metadata, frame policy, and likely source/embed candidates. Does not capture screenshots or persist artifacts.

Parameters* required

url*string

timeout_secondsinteger

discover_embed_candidatesExtract likely embed, preview, viewer, remix, and source URLs from a public or user-authorized page.2 params

Extract likely embed, preview, viewer, remix, and source URLs from a public or user-authorized page.

Parameters* required

url*string

timeout_secondsinteger

classify_clone_modeDecide whether a reference should be embedded, sourced, locally captured, bounded-rebuilt, or blocked before reproduction.5 params

Decide whether a reference should be embedded, sourced, locally captured, bounded-rebuilt, or blocked before reproduction.

Parameters* required

candidatesarray

license_textstring

site_profileobject

source_signalsarray

exact_requestedboolean

generate_embed_snippetGenerate an iframe snippet for a known frameable and authorized URL. Does not verify frameability by itself.3 params

Generate an iframe snippet for a known frameable and authorized URL. Does not verify frameability by itself.

Parameters* required

url*string

titlestring

frameworkstring

one of html · nextjs

plan_reproduction_pathCreate a source-first plan that separates exact embed/source reuse from local capture and bounded rebuild work.6 params

Create a source-first plan that separates exact embed/source reuse from local capture and bounded rebuild work.

Parameters* required

candidatesarray

license_textstring

site_profileobject

capture_bundleobject

source_signalsarray

exact_requestedboolean

webEmbedding

webEmbedding is a source-first website cloning engine for AI coding agents: it captures live pages with Playwright, replays network evidence from HAR artifacts, rebuilds only when direct reuse is blocked, and self-verifies the result.

It ships as a Skill + MCP server. Instead of asking a model to "clone this site" from a screenshot, it inspects the URL, chooses a reuse or rebuild route, captures DOM/runtime HTML/styles/assets/network traces, generates bounded frontend reconstruction artifacts, and checks the output with visual, DOM, computed-style, interaction, and responsive-breakpoint verification.

webEmbedding Skill and MCP workflow

GitHub listing, social preview, and launch-copy recommendations are in docs/github-listing.md.

Current Status

The current pipeline is strongest for static and semi-static web pages:

company, brand, marketing, and documentation pages
public landing pages
iframe-blocked pages that need capture-based reconstruction
responsive page snapshots across desktop, tablet, and mobile

It is not a full backend or app-logic clone engine. Login-only screens, app-first or native-app-required services, captcha-heavy sites, maps, games, canvas/WebGL-heavy pages, real-time feeds, payments, booking flows, and private server behavior still need separate handling.

Operationally, the repo is now a production-candidate clone engine for URL-based capture and bounded reconstruction: jobs can be queued, network evidence can be replay-audited from HAR artifacts, authenticated dashboard runs can be driven from user-owned browser state, and local gates verify the route corpus, score checks, package contents, and CI wiring. The remaining hard boundary is server-side product behavior, not front-end evidence capture and reconstruction.

Measured Checkpoints

Recent local benchmark runs from this repo:

URL	Path	Score
`https://developer.mozilla.org/en-US/`	iframe-blocked bounded rebuild	root `94`, visual `95`, mobile `94`, tablet `94`, breakpoint average `94`
`https://www.mozilla.org/`	bounded rebuild	root `94`, visual `100`
`https://www.python.org`	harder bounded rebuild sample	root `90`, visual `100`
`https://www.example.com`	exact reuse	ready `yes`

These are generated by the local self-verify pipeline, not manually assigned ratings. The reproducible commands and score thresholds are tracked in docs/benchmark-evidence.json. Production readiness gates are tracked in docs/production-pipeline-gates.json.

Core Features

Source-first routing:
- direct iframe or embed reuse when it is safe and frameable
- original preview, export, remix, or source routes when available
- bounded rebuild only when exact reuse is unavailable
Live browser capture:
- DOM snapshot
- runtime HTML
- full-page screenshot
- computed style summaries
- CSS analysis
- asset inventory
- HAR-like network metadata
- interaction states and replay traces
- storage state export for session-aware flows
Blocked-site rebuild:
- handles X-Frame-Options and CSP-blocked pages by rebuilding from captured evidence
- generates reusable frontend reconstruction artifacts from captured page structure
- preserves custom tags, shadow-root host structure, and semantic document structure where captured
Evidence limitation reporting:
- separates directly captured artifacts from inferred or missing evidence in reproduction results and prompts
- marks app-gated, auth-gated, and native-app-led surfaces as bounded evidence, with recommendations for user screenshots or authenticated session capture
Operational failure classification:
- reports typed pipeline action codes such as network-replay-limited, auth-session-missing, public-app-gate, and canvas-visual-fallback
- exposes HAR/network replay_readiness before treating captured network evidence as replay-grade
Production pipeline helpers:
- filesystem-backed async clone job queue with durable JSON records, worker locks, retry scheduling, cancellation, and manifest annotation
- deterministic HAR replay engine for standard HAR, near-HAR, and captured network/manifest.json artifacts
- authenticated dashboard live corpus runner that accepts user-provided storage_state_path or user_data_dir outside the repo
Self-verification:
- screenshot similarity
- DOM snapshot similarity
- computed-style similarity
- hover/focus/click interaction state parity
- interaction trace parity
- desktop/mobile/tablet breakpoint reports
Responsive benchmark support:
- primary desktop viewport: 1440x1200
- tablet profile: 768x1024
- mobile profile: 390x844
Repair loop:
- bounded self-repair can run when the first scaffold misses the readiness threshold

Install

Requirements

Node.js 18 or newer
Python 3.9 or newer
Chrome or Chromium available locally for Playwright runtime capture

The package uses playwright-core; it does not download a browser by itself.

Installing this project adds the source-first-clone plugin bundle, the exact-clone-intake skill, and the MCP server that exposes the URL inspection, capture, rebuild, and verification tools.

Install From npm

npm install -g web-embedding
web-embedding install
web-embedding doctor

Clone a public URL after installing:

web-embedding clone \
  --url https://developer.mozilla.org/en-US/ \
  --output-dir ./.tmp/mdn-clone \
  --wait-seconds 2 \
  --timeout-seconds 35 \
  --breakpoints mobile tablet

If you already have an older local plugin installed, overwrite it with:

web-embedding install --force
web-embedding doctor

You can also run the installer without a global install:

npx web-embedding install

Use As An MCP Server

For MCP clients that can launch npm stdio servers:

{
  "mcpServers": {
    "source-first-clone": {
      "command": "npx",
      "args": ["-y", "web-embedding@latest", "mcp"]
    }
  }
}

For local smoke testing:

npx web-embedding@latest mcp

The MCP Registry identity is io.github.jongko54/web-embedding; server.json and package.json#mcpName are kept in sync for registry ownership verification.

Hosted Apps SDK Intake Endpoint

The public remote MCP intake endpoint for Apps SDK Developer Mode is:

https://webembedding-mcp.vercel.app/mcp

It exposes low-risk source-first routing tools such as URL inspection, embed candidate discovery, clone-mode classification, and embed snippet generation. Full browser capture, HAR replay, queues, bounded rebuilds, and one-pass clone execution remain local-first through the stdio MCP package.

Apps SDK review pages are hosted alongside the endpoint: https://webembedding-mcp.vercel.app/privacy.html, https://webembedding-mcp.vercel.app/terms.html, and https://webembedding-mcp.vercel.app/submission.html.

Sandboxing And Approvals

webEmbedding has two different execution boundaries:

Hosted Apps SDK intake: read-only URL routing and classification only. It accepts absolute http and https URLs, does not run Playwright, does not read local files, does not use browser profiles or storage state, and does not persist capture artifacts.
Local stdio MCP and CLI: full capture, HAR replay, queues, rebuild scaffolds, and self-verify run on the user's machine under the user's local agent and filesystem permissions. Output is written only to caller-provided paths such as output_dir or queue_root.
Authenticated capture: session-aware runs require the caller to intentionally provide a storage_state_path or user_data_dir. webEmbedding does not collect credentials, perform login bypasses, or treat a public login shell as private app evidence.
Access-controlled surfaces: paywalls, captcha flows, private dashboards, payment/checkout/account/admin flows, and native-app-led screens should be blocked, marked needs_session, or sent to manual review unless the user has explicit authorization and supplies the needed evidence.

Local URL entrypoints reject non-HTTP schemes such as file:// so an agent cannot use clone/capture tools as a local file reader. Telemetry is disabled by default and, when enabled, excludes target URLs, local paths, captured HTML, screenshots, storage state, environment variables, API keys, and command output.

Agent Marketplaces

This repository includes marketplace metadata for the two local agent surfaces:

Codex: .agents/plugins/marketplace.json points to ./bundle/source-first-clone.
Claude Code: .claude-plugin/marketplace.json points to the same bundle and the bundle includes .claude-plugin/plugin.json.

Claude Code users can add the marketplace from GitHub with:

/plugin marketplace add jongko54/webEmbedding
/plugin install source-first-clone@webembedding

AI auto-selection expectations and golden prompts live in docs/ai-distribution.md and evals/ai-selection/webembedding-golden-prompts.json.

Install From Release

curl -fsSL https://github.com/jongko54/webEmbedding/releases/latest/download/install.sh | bash

Install From This Checkout

git clone https://github.com/jongko54/webEmbedding.git
cd webEmbedding
npm install
node ./bin/web-embedding.mjs install
node ./bin/web-embedding.mjs doctor

Install Into A Temporary Home

Useful for testing without touching your real agent home:

python3 python/web_embedding/installer.py install --target-home ./.tmp/home
python3 python/web_embedding/installer.py doctor --target-home ./.tmp/home
python3 python/web_embedding/installer.py uninstall --target-home ./.tmp/home

Opt-in Telemetry

Telemetry is disabled by default. On an interactive first install, web-embedding install asks once and defaults to No. Non-interactive installs such as CI and curl | bash do not prompt. If you opt in, web-embedding sends a small anonymous command-completion event to a JSON POST endpoint you control. It does not send target URLs, local paths, captured HTML, screenshots, storage state, environment variables, API keys, or command output.

Enable it during install:

web-embedding install --telemetry --telemetry-endpoint https://your-collector.example/events

Or manage it later:

web-embedding telemetry enable --endpoint https://your-collector.example/events
web-embedding telemetry status
web-embedding telemetry disable
web-embedding telemetry reset-id

Each event contains an anonymous install id, package version, command name, success/failure status, OS/runtime basics, and coarse option flags such as breakpoint_count or install_source.

Environment controls:

WEB_EMBEDDING_TELEMETRY=1
WEB_EMBEDDING_NO_TELEMETRY=1
WEB_EMBEDDING_TELEMETRY_PROMPT=0
WEB_EMBEDDING_TELEMETRY_ENDPOINT=https://your-collector.example/events
WEB_EMBEDDING_TELEMETRY_LOG=./telemetry.jsonl

Run a local/self-hosted JSONL collector:

npm run telemetry:collector -- --host 127.0.0.1 --port 8765 --out ./telemetry.jsonl
WEB_EMBEDDING_TELEMETRY=1 \
WEB_EMBEDDING_TELEMETRY_ENDPOINT=http://127.0.0.1:8765/events \
web-embedding doctor

Summarize collected usage:

npm run telemetry:summarize -- ./telemetry.jsonl

The summary includes install and clone executions, total command executions, unique anonymous install IDs, command counts, and version counts. See docs/telemetry.md for collector and analyzer details.

Quick Start

Inspect a URL and get route hints:

node ./bin/web-embedding.mjs inspect \
  --url https://developer.mozilla.org/en-US/

Run a safe preflight audit before capture or clone:

node ./bin/web-embedding.mjs audit \
  --url https://developer.mozilla.org/en-US/

The audit reports whether the reference is ready for exact/embed reuse, needs local capture, needs an authenticated session, requires manual review, or should be blocked before any browser capture or filesystem output runs.

Run the full clone workflow:

node ./bin/web-embedding.mjs clone \
  --url https://developer.mozilla.org/en-US/ \
  --output-dir ./.tmp/mdn-clone \
  --wait-seconds 2 \
  --timeout-seconds 35 \
  --breakpoints mobile tablet

Run a lightweight quality benchmark:

python3 scripts/check_clone_quality_bench.py \
  https://developer.mozilla.org/en-US/ \
  --output-root ./.tmp/clone-quality-bench \
  --wait-seconds 1 \
  --timeout-seconds 35 \
  --breakpoints mobile tablet

The benchmark prints compact rows for root, visual, and breakpoint scores. The full artifacts are written under the output directory.

CLI Commands

node ./bin/web-embedding.mjs capabilities
node ./bin/web-embedding.mjs install
node ./bin/web-embedding.mjs doctor
node ./bin/web-embedding.mjs uninstall
node ./bin/web-embedding.mjs paths
node ./bin/web-embedding.mjs telemetry status

node ./bin/web-embedding.mjs inspect --url https://www.mozilla.org/

node ./bin/web-embedding.mjs audit --url https://www.mozilla.org/

node ./bin/web-embedding.mjs capture \
  --url https://www.mozilla.org/ \
  --output-dir ./.tmp/capture-mozilla \
  --breakpoints mobile tablet

node ./bin/web-embedding.mjs reproduce \
  --url https://www.mozilla.org/ \
  --output-dir ./.tmp/reproduce-mozilla \
  --breakpoints mobile tablet

node ./bin/web-embedding.mjs clone \
  --url https://www.mozilla.org/ \
  --output-dir ./.tmp/clone-mozilla \
  --breakpoints mobile tablet

node ./bin/web-embedding.mjs verify \
  --reference-bundle ./.tmp/reference/capture.json \
  --candidate-bundle ./.tmp/candidate/capture.json

Output Artifacts

A clone run can produce:

capture.json
pipeline-run-manifest.json
dom/snapshot.json
dom/runtime.html
styles/computed-summary.json
styles/css-analysis.json
network/manifest.json
network/har.json
network/har-like.json
network/replay-report.json
assets/inventory.json
interactions/states.json
interactions/trace.json
screenshots/runtime.png
session/storage-state.json
reproduction/plan.json
reproduction/evidence-limitations.json
reproduction/rebuild-prompt.txt
reproduction/rebuild/starter.html
reproduction/rebuild/starter.css
reproduction/rebuild/starter.tsx
reproduction/rebuild/next-app/
reproduction/self-verify/summary.json
reproduction/self-verify/renderers/*/verification.json
reproduction/self-verify/renderers/*/visual-qa.json
reproduction/self-verify/renderers/*/breakpoints/*-verification.json

Quality Benchmark

Run the default small benchmark:

npm run check:clone-bench:local

Run the universal route regression corpus and expectations gate:

npm run check:benchmark-routes:local

Run a lightweight clone score gate:

npm run check:clone-score-gate:local

Validate the committed benchmark evidence manifest:

npm run check:benchmark-evidence:local

Validate production pipeline gates:

npm run check:production-readiness:local

Run the operational smokes individually:

npm run check:job-queue:local
npm run check:har-replay:local
npm run check:authenticated-corpus:local

Classify failure/action codes from a route report:

npm run classify:pipeline-failures -- --report ./.tmp/universal-route-benchmark/universal-route-report.json

Find low-scoring persisted benchmark artifacts:

npm run summarize:benchmark-scores -- --root ./.tmp --min-score 60 --max-score 70

Run specific URLs:

python3 scripts/check_clone_quality_bench.py \
  https://www.example.com \
  https://www.mozilla.org/ \
  --no-breakpoints

Run a responsive benchmark:

python3 scripts/check_clone_quality_bench.py \
  https://developer.mozilla.org/en-US/ \
  --breakpoints mobile tablet

Development Checks

python3 -m py_compile \
  bundle/source-first-clone/mcp/source_first_clone/*.py \
  scripts/check_integration_smoke.py \
  scripts/check_clone_quality_bench.py

npm run check:integration:local

git diff --check

Repo Layout

bundle/source-first-clone Installed plugin bundle, MCP server, and exact-clone intake skill.
bundle/source-first-clone/mcp/source_first_clone Capture, planning, rebuild, repair, and verification engine.
bin/web-embedding.mjs Node CLI wrapper.
python/web_embedding/installer.py Shared installer and command dispatcher.
scripts/check_clone_quality_bench.py URL clone quality benchmark helper.
scripts/benchmark_routes.py Universal route/capture-depth regression benchmark helper.
scripts/check_benchmark_report.py Benchmark expectation validator for exact, minimum, and contains-style checks.
scripts/check_benchmark_evidence.py Benchmark evidence manifest validator.
scripts/check_job_queue_smoke.py Filesystem async clone job queue smoke test.
scripts/check_har_replay_smoke.py Deterministic HAR replay engine smoke test.
scripts/benchmark_authenticated_corpus.py User-provided authenticated dashboard corpus runner.
scripts/summarize_benchmark_scores.py Utility for finding low or high scoring persisted benchmark artifacts under an output root.
scripts/classify_pipeline_failures.py Operational failure/action taxonomy summarizer for reports and capture artifacts.
scripts/check_production_readiness.py Production readiness gate validator for corpus, failure taxonomy, CI wiring, and policy docs.
scripts/check_integration_smoke.py Release, install, and URL-only clone smoke test.
scripts/release_bundle.py Release artifact builder.
docs/ Architecture notes and universal benchmark documentation.

Positioning

The strongest claim for this project is:

A source-first website cloning engine that combines Playwright capture, HAR replay, MCP tools, and self-verification to rebuild iframe-blocked public pages with reproducible visual, DOM, style, interaction, and responsive scores.

Avoid treating the output as a legal or ownership bypass. The engine can reconstruct public page structure, but permission, licensing, and acceptable use still matter.

License

MIT

Featured

CodeRabbit

AI writes the code. CodeRabbit catches the slop.

Try For Free →

Keep your Mac awake

Keep your Mac awake while Claude Code and 40+ AI agents run. Sleeps when they're idle.

One time payment $9 →

Context.dev

Integrate web data into your AI product. One API to scrape website & brand data.

Get API Key Now →

Make your agent a DeFi expert

Agent, run crypto. Access onchain data & trade routes via 1inch.

Install now →

Make money from your Skills

On Capafy, your Skill runs online 24/7 as an agent product, and you get paid every time someone uses it.

Start earning →

AppSignal

Monitor with ease. Code with confidence.

Start Free Trial →

webEmbedding

webEmbedding Skill and MCP workflow

GitHub listing, social preview, and launch-copy recommendations are in docs/github-listing.md.

Current Status

The current pipeline is strongest for static and semi-static web pages:

company, brand, marketing, and documentation pages
public landing pages
iframe-blocked pages that need capture-based reconstruction
responsive page snapshots across desktop, tablet, and mobile

Measured Checkpoints

Recent local benchmark runs from this repo:

URL	Path	Score
`https://developer.mozilla.org/en-US/`	iframe-blocked bounded rebuild	root `94`, visual `95`, mobile `94`, tablet `94`, breakpoint average `94`
`https://www.mozilla.org/`	bounded rebuild	root `94`, visual `100`
`https://www.python.org`	harder bounded rebuild sample	root `90`, visual `100`
`https://www.example.com`	exact reuse	ready `yes`

Core Features

Source-first routing:
- direct iframe or embed reuse when it is safe and frameable
- original preview, export, remix, or source routes when available
- bounded rebuild only when exact reuse is unavailable
Live browser capture:
- DOM snapshot
- runtime HTML
- full-page screenshot
- computed style summaries
- CSS analysis
- asset inventory
- HAR-like network metadata
- interaction states and replay traces
- storage state export for session-aware flows
Blocked-site rebuild:
- handles X-Frame-Options and CSP-blocked pages by rebuilding from captured evidence
- generates reusable frontend reconstruction artifacts from captured page structure
- preserves custom tags, shadow-root host structure, and semantic document structure where captured
Evidence limitation reporting:
- separates directly captured artifacts from inferred or missing evidence in reproduction results and prompts
- marks app-gated, auth-gated, and native-app-led surfaces as bounded evidence, with recommendations for user screenshots or authenticated session capture
Operational failure classification:
- reports typed pipeline action codes such as network-replay-limited, auth-session-missing, public-app-gate, and canvas-visual-fallback
- exposes HAR/network replay_readiness before treating captured network evidence as replay-grade
Production pipeline helpers:
- filesystem-backed async clone job queue with durable JSON records, worker locks, retry scheduling, cancellation, and manifest annotation
- deterministic HAR replay engine for standard HAR, near-HAR, and captured network/manifest.json artifacts
- authenticated dashboard live corpus runner that accepts user-provided storage_state_path or user_data_dir outside the repo
Self-verification:
- screenshot similarity
- DOM snapshot similarity
- computed-style similarity
- hover/focus/click interaction state parity
- interaction trace parity
- desktop/mobile/tablet breakpoint reports
Responsive benchmark support:
- primary desktop viewport: 1440x1200
- tablet profile: 768x1024
- mobile profile: 390x844
Repair loop:
- bounded self-repair can run when the first scaffold misses the readiness threshold

Install

Requirements

Node.js 18 or newer
Python 3.9 or newer
Chrome or Chromium available locally for Playwright runtime capture

The package uses playwright-core; it does not download a browser by itself.

Installing this project adds the source-first-clone plugin bundle, the exact-clone-intake skill, and the MCP server that exposes the URL inspection, capture, rebuild, and verification tools.

Install From npm

npm install -g web-embedding
web-embedding install
web-embedding doctor

Clone a public URL after installing:

web-embedding clone \
  --url https://developer.mozilla.org/en-US/ \
  --output-dir ./.tmp/mdn-clone \
  --wait-seconds 2 \
  --timeout-seconds 35 \
  --breakpoints mobile tablet

If you already have an older local plugin installed, overwrite it with:

web-embedding install --force
web-embedding doctor

You can also run the installer without a global install:

npx web-embedding install

Use As An MCP Server

For MCP clients that can launch npm stdio servers:

{
  "mcpServers": {
    "source-first-clone": {
      "command": "npx",
      "args": ["-y", "web-embedding@latest", "mcp"]
    }
  }
}

For local smoke testing:

npx web-embedding@latest mcp

The MCP Registry identity is io.github.jongko54/web-embedding; server.json and package.json#mcpName are kept in sync for registry ownership verification.

Hosted Apps SDK Intake Endpoint

The public remote MCP intake endpoint for Apps SDK Developer Mode is:

https://webembedding-mcp.vercel.app/mcp

Sandboxing And Approvals

webEmbedding has two different execution boundaries:

Hosted Apps SDK intake: read-only URL routing and classification only. It accepts absolute http and https URLs, does not run Playwright, does not read local files, does not use browser profiles or storage state, and does not persist capture artifacts.
Local stdio MCP and CLI: full capture, HAR replay, queues, rebuild scaffolds, and self-verify run on the user's machine under the user's local agent and filesystem permissions. Output is written only to caller-provided paths such as output_dir or queue_root.
Authenticated capture: session-aware runs require the caller to intentionally provide a storage_state_path or user_data_dir. webEmbedding does not collect credentials, perform login bypasses, or treat a public login shell as private app evidence.
Access-controlled surfaces: paywalls, captcha flows, private dashboards, payment/checkout/account/admin flows, and native-app-led screens should be blocked, marked needs_session, or sent to manual review unless the user has explicit authorization and supplies the needed evidence.

Agent Marketplaces

This repository includes marketplace metadata for the two local agent surfaces:

Codex: .agents/plugins/marketplace.json points to ./bundle/source-first-clone.
Claude Code: .claude-plugin/marketplace.json points to the same bundle and the bundle includes .claude-plugin/plugin.json.

Claude Code users can add the marketplace from GitHub with:

/plugin marketplace add jongko54/webEmbedding
/plugin install source-first-clone@webembedding

AI auto-selection expectations and golden prompts live in docs/ai-distribution.md and evals/ai-selection/webembedding-golden-prompts.json.

Install From Release

curl -fsSL https://github.com/jongko54/webEmbedding/releases/latest/download/install.sh | bash

Install From This Checkout

git clone https://github.com/jongko54/webEmbedding.git
cd webEmbedding
npm install
node ./bin/web-embedding.mjs install
node ./bin/web-embedding.mjs doctor

Install Into A Temporary Home

Useful for testing without touching your real agent home:

python3 python/web_embedding/installer.py install --target-home ./.tmp/home
python3 python/web_embedding/installer.py doctor --target-home ./.tmp/home
python3 python/web_embedding/installer.py uninstall --target-home ./.tmp/home

Opt-in Telemetry

Enable it during install:

web-embedding install --telemetry --telemetry-endpoint https://your-collector.example/events

Or manage it later:

web-embedding telemetry enable --endpoint https://your-collector.example/events
web-embedding telemetry status
web-embedding telemetry disable
web-embedding telemetry reset-id

Each event contains an anonymous install id, package version, command name, success/failure status, OS/runtime basics, and coarse option flags such as breakpoint_count or install_source.

Environment controls:

WEB_EMBEDDING_TELEMETRY=1
WEB_EMBEDDING_NO_TELEMETRY=1
WEB_EMBEDDING_TELEMETRY_PROMPT=0
WEB_EMBEDDING_TELEMETRY_ENDPOINT=https://your-collector.example/events
WEB_EMBEDDING_TELEMETRY_LOG=./telemetry.jsonl

Run a local/self-hosted JSONL collector:

npm run telemetry:collector -- --host 127.0.0.1 --port 8765 --out ./telemetry.jsonl
WEB_EMBEDDING_TELEMETRY=1 \
WEB_EMBEDDING_TELEMETRY_ENDPOINT=http://127.0.0.1:8765/events \
web-embedding doctor

Summarize collected usage:

npm run telemetry:summarize -- ./telemetry.jsonl

Quick Start

Inspect a URL and get route hints:

node ./bin/web-embedding.mjs inspect \
  --url https://developer.mozilla.org/en-US/

Run a safe preflight audit before capture or clone:

node ./bin/web-embedding.mjs audit \
  --url https://developer.mozilla.org/en-US/

Run the full clone workflow:

node ./bin/web-embedding.mjs clone \
  --url https://developer.mozilla.org/en-US/ \
  --output-dir ./.tmp/mdn-clone \
  --wait-seconds 2 \
  --timeout-seconds 35 \
  --breakpoints mobile tablet

Run a lightweight quality benchmark:

python3 scripts/check_clone_quality_bench.py \
  https://developer.mozilla.org/en-US/ \
  --output-root ./.tmp/clone-quality-bench \
  --wait-seconds 1 \
  --timeout-seconds 35 \
  --breakpoints mobile tablet

The benchmark prints compact rows for root, visual, and breakpoint scores. The full artifacts are written under the output directory.

CLI Commands

node ./bin/web-embedding.mjs capabilities
node ./bin/web-embedding.mjs install
node ./bin/web-embedding.mjs doctor
node ./bin/web-embedding.mjs uninstall
node ./bin/web-embedding.mjs paths
node ./bin/web-embedding.mjs telemetry status

node ./bin/web-embedding.mjs inspect --url https://www.mozilla.org/

node ./bin/web-embedding.mjs audit --url https://www.mozilla.org/

node ./bin/web-embedding.mjs capture \
  --url https://www.mozilla.org/ \
  --output-dir ./.tmp/capture-mozilla \
  --breakpoints mobile tablet

node ./bin/web-embedding.mjs reproduce \
  --url https://www.mozilla.org/ \
  --output-dir ./.tmp/reproduce-mozilla \
  --breakpoints mobile tablet

node ./bin/web-embedding.mjs clone \
  --url https://www.mozilla.org/ \
  --output-dir ./.tmp/clone-mozilla \
  --breakpoints mobile tablet

node ./bin/web-embedding.mjs verify \
  --reference-bundle ./.tmp/reference/capture.json \
  --candidate-bundle ./.tmp/candidate/capture.json

Output Artifacts

A clone run can produce:

capture.json
pipeline-run-manifest.json
dom/snapshot.json
dom/runtime.html
styles/computed-summary.json
styles/css-analysis.json
network/manifest.json
network/har.json
network/har-like.json
network/replay-report.json
assets/inventory.json
interactions/states.json
interactions/trace.json
screenshots/runtime.png
session/storage-state.json
reproduction/plan.json
reproduction/evidence-limitations.json
reproduction/rebuild-prompt.txt
reproduction/rebuild/starter.html
reproduction/rebuild/starter.css
reproduction/rebuild/starter.tsx
reproduction/rebuild/next-app/
reproduction/self-verify/summary.json
reproduction/self-verify/renderers/*/verification.json
reproduction/self-verify/renderers/*/visual-qa.json
reproduction/self-verify/renderers/*/breakpoints/*-verification.json

Quality Benchmark

Run the default small benchmark:

npm run check:clone-bench:local

Run the universal route regression corpus and expectations gate:

npm run check:benchmark-routes:local

Run a lightweight clone score gate:

npm run check:clone-score-gate:local

Validate the committed benchmark evidence manifest:

npm run check:benchmark-evidence:local

Validate production pipeline gates:

npm run check:production-readiness:local

Run the operational smokes individually:

npm run check:job-queue:local
npm run check:har-replay:local
npm run check:authenticated-corpus:local

Classify failure/action codes from a route report:

npm run classify:pipeline-failures -- --report ./.tmp/universal-route-benchmark/universal-route-report.json

Find low-scoring persisted benchmark artifacts:

npm run summarize:benchmark-scores -- --root ./.tmp --min-score 60 --max-score 70

Run specific URLs:

python3 scripts/check_clone_quality_bench.py \
  https://www.example.com \
  https://www.mozilla.org/ \
  --no-breakpoints

Run a responsive benchmark:

python3 scripts/check_clone_quality_bench.py \
  https://developer.mozilla.org/en-US/ \
  --breakpoints mobile tablet

Development Checks

python3 -m py_compile \
  bundle/source-first-clone/mcp/source_first_clone/*.py \
  scripts/check_integration_smoke.py \
  scripts/check_clone_quality_bench.py

npm run check:integration:local

git diff --check

Repo Layout

bundle/source-first-clone Installed plugin bundle, MCP server, and exact-clone intake skill.
bundle/source-first-clone/mcp/source_first_clone Capture, planning, rebuild, repair, and verification engine.
bin/web-embedding.mjs Node CLI wrapper.
python/web_embedding/installer.py Shared installer and command dispatcher.
scripts/check_clone_quality_bench.py URL clone quality benchmark helper.
scripts/benchmark_routes.py Universal route/capture-depth regression benchmark helper.
scripts/check_benchmark_report.py Benchmark expectation validator for exact, minimum, and contains-style checks.
scripts/check_benchmark_evidence.py Benchmark evidence manifest validator.
scripts/check_job_queue_smoke.py Filesystem async clone job queue smoke test.
scripts/check_har_replay_smoke.py Deterministic HAR replay engine smoke test.
scripts/benchmark_authenticated_corpus.py User-provided authenticated dashboard corpus runner.
scripts/summarize_benchmark_scores.py Utility for finding low or high scoring persisted benchmark artifacts under an output root.
scripts/classify_pipeline_failures.py Operational failure/action taxonomy summarizer for reports and capture artifacts.
scripts/check_production_readiness.py Production readiness gate validator for corpus, failure taxonomy, CI wiring, and policy docs.
scripts/check_integration_smoke.py Release, install, and URL-only clone smoke test.
scripts/release_bundle.py Release artifact builder.
docs/ Architecture notes and universal benchmark documentation.

Positioning

The strongest claim for this project is:

A source-first website cloning engine that combines Playwright capture, HAR replay, MCP tools, and self-verification to rebuild iframe-blocked public pages with reproducible visual, DOM, style, interaction, and responsive scores.

Avoid treating the output as a legal or ownership bypass. The engine can reconstruct public page structure, but permission, licensing, and acceptable use still matter.

License

MIT

webEmbedding

Install to Claude Code

Tools

webEmbedding

Current Status

Measured Checkpoints

Core Features

Install

Requirements

Install From npm

Use As An MCP Server

Hosted Apps SDK Intake Endpoint

Sandboxing And Approvals

Agent Marketplaces

Install From Release

Install From This Checkout

Install Into A Temporary Home

Opt-in Telemetry

Quick Start

CLI Commands

Output Artifacts

Quality Benchmark

Development Checks

Repo Layout

Positioning

License

webEmbedding

Install to Claude Code

Tools

webEmbedding

Current Status

Measured Checkpoints

Core Features

Install

Requirements

Install From npm

Use As An MCP Server

Hosted Apps SDK Intake Endpoint

Sandboxing And Approvals

Agent Marketplaces

Install From Release

Install From This Checkout

Install Into A Temporary Home

Opt-in Telemetry

Quick Start

CLI Commands

Output Artifacts

Quality Benchmark

Development Checks

Repo Layout

Positioning

License

Related AI & LLM Tools MCP Servers

Related AI & LLM Tools MCP Servers