This server turns any social video into agent-readable material: evenly spaced JPG frames plus a full transcript. Under the hood it chains yt-dlp, ffmpeg, and Kyma's Whisper-class ASR into a single stdio call. Works across YouTube, X, LinkedIn, TikTok, Reddit, Vimeo, and Facebook, with automatic browser cookie fallback for login-walled posts. You get the raw frames and transcript text instead of a pre-digested summary, so your agent can reason over what was actually shown and said. Useful when you want Claude to extract architecture diagrams from conference talks, clone a UI demo into working React, or convert a coding walkthrough into runnable project files.
Watch any social video → get an architecture diagram, working component, runnable notebook, or step-by-step cheat sheet — automatically.
Eyes and ears for your AI agent. watch-cli composes yt-dlp + ffmpeg + a Whisper-class ASR into a single command that hands an agent the raw materials to "watch" any video: VIDEO + FRAMES + TRANSCRIPT, ready for an LLM to read frames as images and transcript as text.
watch https://twitter.com/anyone/status/12345
Works on YouTube, X, LinkedIn, TikTok, Reddit, Vimeo, and Facebook. Login-walled posts (LinkedIn, private X, FB) fall back to your browser cookies automatically.
Hand the watch output to your agent with one of five prompts in prompts/:
| Drop in a video of… | Get back |
|---|---|
| A coding walkthrough | Working project files |
| A system architecture talk | Interactive architecture diagram |
| A UI / motion demo | Working React component |
| A paper or research talk | Runnable notebook |
| A long tutorial | Step-by-step cheat sheet |
The prompt library is what turns "video → frames + transcript" into "video → working artifact". The full Prompt library section below has copy-paste templates.
Large language models can't watch video natively — they read text and look at still images. You can hand a video to a multimodal API and get back a chat-style summary, but for an agent workflow that's the wrong artifact: the agent wants the raw frames and the full transcript so it can reason for itself, not someone else's pre-digested recap.
A video is just frames + audio, and each piece already has a fast, near-free primitive:
yt-dlp downloads from any social platformffmpeg extracts evenly-spaced framesCompose them and your agent has the materials to watch any social video.
$ watch https://www.linkedin.com/posts/some-talk_activity-12345
VIDEO: /tmp/dl-video/abc123.mp4
DURATION: 218
FRAMES:
/tmp/frames_abc123/frame_01.jpg
/tmp/frames_abc123/frame_02.jpg
…
TRANSCRIPT:
Today I want to talk about how decomposition unlocks 10× cost reduction in
multimodal pipelines …
Your agent reads the JPGs and the transcript. That's the whole watch.
Most subscription summary tools start around $15/month and deliver a polished, human-readable summary. If you're feeding an AI agent, that's the wrong artifact — agents need raw frames and the full transcript to reason for themselves, not someone else's pre-digested recap.
A typical research session is 1–3 videos, not 100. Through Kyma — the default backend — a 1-hour video costs ~$0.05 (transcribe is the only paid step; frame extraction is local ffmpeg).
| This month you watch | You pay |
|---|---|
| 0 videos | $0 |
| 1 one-hour video | ~$0.05 |
| 100 one-hour videos | ~$5 |
No monthly minimum, no seat license, no lock-in. The free credit at Kyma signup is enough to run the full pipeline end-to-end before you spend a cent.
# macOS — Homebrew (recommended)
brew tap sonpiaz/tap
brew install watch-cli
# Any OS — curl
curl -fsSL https://github.com/sonpiaz/watch-cli/releases/latest/download/install.sh | bash
The curl one-liner auto-falls back to
git cloneofmainif no published release tarball is reachable.
If you use Claude Code, install watch-cli as a skill:
/plugin marketplace add sonpiaz/watch-cli
/plugin install watch-cli@watch-cli
The agent then picks up watch <url> as a first-class command.
Pin a specific version:
WATCH_CLI_VERSION=0.3.0 curl -fsSL \
https://github.com/sonpiaz/watch-cli/releases/download/v0.3.0/install.sh | bash
Or from a clone:
git clone https://github.com/sonpiaz/watch-cli ~/.watch-cli
cd ~/.watch-cli && ./install.sh
The installer checks for yt-dlp, ffmpeg, jq, curl, python3 and
symlinks the commands into ~/.local/bin. On macOS:
brew install yt-dlp ffmpeg jq
On Debian/Ubuntu:
sudo apt install yt-dlp ffmpeg jq python3 curl
./install.sh --with-skill # also drop SKILL.md into ~/.claude/skills/watch-cli/
./install.sh --with-mcp # print the npm install hint for the MCP stdio server
--with-skill copies the portable SKILL.md into ~/.claude/skills/watch-cli/
so Claude Code picks up watch-cli as a skill on next start. The same file
works in OpenClaw and hermes-agent — see SKILL.md.--with-mcp prints the manual install line for @sonpiaz/watch-cli-mcp,
the MCP stdio server that exposes watch-cli to Claude Desktop, Cursor, Cline,
Continue.dev, Windsurf, Zed, and any other MCP-capable client. The flag will
auto-install once the package is published to npm.export KYMA_API_KEY=kyma-xxxxxxxx
Get a Kyma key at kymaapi.com — 60 seconds, no card.
Prefer bring-your-own-keys? Comment in GROQ_API_KEY and GOOGLE_AI_KEY
in .env.example and watch-cli falls back to direct provider calls.
watch-cli uses Kyma as its AI backend. A few things you get for free:
transcribe, audio-understand). When Kyma swaps
in a better model behind the alias, your scripts keep working unchanged.The badges above pull live from api.kymaapi.com/api/stats, so the model
count and free-credit number stay current without a watch-cli release.
watch <url> [frame-count] [--cookies <file>]
Orchestrator. Downloads, extracts frames, transcribes — one block out.
dl-video <url> [out-dir] [--cookies <file>]
Just download the video. Returns the local mp4 path.
extract-frames <video> [count] [out-dir]
Pull N evenly-spaced JPG frames. Default 8.
transcribe <audio-or-video> [language]
Speech-to-text. Auto-extracts audio from video first.
audio-q <audio-or-video> "<question>"
Audio scene Q&A — tone, music, SFX, language, emotion.
Beyond pure transcription.
models [--all]
List audio models available on Kyma (live, no hardcoded list).
--all to see every Kyma SKU (text + image + video + audio).
transcribe and audio-q stay currentThe scripts call Kyma using the transcribe and audio-understand aliases,
not raw model IDs. When Kyma swaps the underlying model (Whisper v4,
Voxtral, a faster ASR), watch-cli keeps working without an update — the
alias points to whichever model is current. Run watch-cli models any time
to see what's behind the alias today.
Most YouTube / TikTok / Reddit / Vimeo / public X work without setup. LinkedIn, private X posts, and Facebook need a session.
watch-cli auto-detects cookies from any signed-in browser (Chrome → Firefox → Safari → Edge → Brave → Chromium). Just sign in normally and re-run.
For servers / CI without browsers, pass a manual cookies file:
watch <url> --cookies ~/cookies.txt
Full setup walkthrough: docs/cookies.md.
You have access to a `watch` command that takes a URL and returns
a video, 8 frames, and the transcript. Read the frames as images and
the transcript as text — that's enough to "watch" any social video.
The output block is structured so an agent can parse it without help:
VIDEO: line, FRAMES: block (one path per line), TRANSCRIPT: block.
Beyond the generic prompt above, five copy-paste prompts in
prompts/ turn watch output into a specific artifact:
| Goal | File |
|---|---|
| Coding walkthrough → working project | implement-from-video.md |
| System talk → interactive architecture diagram | extract-architecture.md |
| UI / motion demo → working React component | clone-ux.md |
| Paper / research talk → runnable notebook | paper-to-code.md |
| Long tutorial → step-by-step cheat sheet | tutorial-walkthrough.md |
Paste the chosen prompt above the watch output, hand the whole thing
to your agent.
Drop skills/watch-cli/ into your
~/.claude/skills/ folder and the agent will pick up /watch <url>
as a first-class command, including the prompt library above.
mkdir -p ~/.claude/skills
cp -r skills/watch-cli ~/.claude/skills/
URL ──▶ yt-dlp ──▶ video.mp4 ──┬──▶ ffmpeg ──▶ frames/*.jpg
│
└──▶ ffmpeg ──▶ audio.mp3 ──┬──▶ Kyma /v1/audio/transcriptions
│ (Whisper Large v3 Turbo, 228× realtime)
│
└──▶ Kyma /v1/audio/understand
(Gemini 3 Flash audio — tone/music/SFX)
Each step is a primitive. None of them needs a vision LLM.
Built something cool from a video? Drop it in Discussions under Show and tell. Post the source URL, the prompt you used, and your artifact. Curated highlights make it back into the README.
Watch-cli is fast and cheap because it composes primitives instead of calling a video LLM. The tradeoffs are honest.
Transcription is the only paid step. Frame extraction is local ffmpeg, free.
| Video length | Transcribe cost |
|---|---|
| 5 minutes (tweet, short demo) | ~$0.005 |
| 1 hour (LinkedIn talk, podcast) | ~$0.05 |
| 2 hours (conference talk) | ~$0.11 |
Free credit at Kyma signup covers about 9 hours of transcribe. A BYOK
path is available — see .env.example.
watch <url> 24.audio-q for a scene description instead.ffmpeg -ss before piping.audio-q for any sound design instead.| Video type | Recommended frame-count |
|---|---|
| Short tweet / clip (<2 min) | 4 to 8 (default) |
| Standard tutorial / talk (5–20 min) | 8 to 16 |
| Long talk / lecture (20–60 min) | 16 to 24 |
| Conference talk / multi-hour (>1 hr) | 24 to 32 |
| Fast-cut or dense UI demo | Double the recommendation for that length |
MIT. © 2026 Son Piaz.
KYMA_API_KEYsecretKyma API key (default backend). Get one at https://kymaapi.com — first $0.50 free, then pay-per-use (~$0.05 per 1-hour video). Optional if WATCH_AUDIO_MODE=local (whisper.cpp) or GROQ_API_KEY is set instead.
GROQ_API_KEYsecretGroq API key for direct BYOK transcription (alternative to Kyma). Set this instead of KYMA_API_KEY if you prefer direct provider.
WATCH_AUDIO_MODEBackend routing override. Values: kyma (default), byok (use GROQ_API_KEY), local (whisper.cpp on $PATH, fully offline).
io.github.ericm1018/skillfm-llm-cost-optimizer-openai-anthropic-usage
io.github.mikerawsonnz/llm-orchestration-agent
io.github.mikerawsonnz/authenticated-llm-agent
labforgedev/copilot-memory-mcp
csoai-org/agent-prompt-injection-firewall-mcp
io.github.mikerawsonnz/authenticated-multi-llm-agent