This orchestrates Nano Banana and Kling AI to generate talking head videos that don't scream "AI generated." The workflow is thorough: generate a base image with intentional imperfections like pores and mixed lighting, chunk your script into 55-60 syllable segments for natural pacing, then generate 10-second video clips with choreographed micro-movements. The documentation is refreshingly honest about the hands problem (AI models can't do realistic hand movement, so crop them out or keep them static). It's a multi-step process that requires post-production, but if you need UGC-style spokesperson videos and care about them looking real, this gives you the specific prompting strategies and pacing math to pull it off.
npx -y skills add dennisonbertram/claude-media-skills --skill realistic-ugc-video --agent claude-codeInstalls into .claude/skills of the current project.
Create long-form AI videos that look and sound authentically human. This skill orchestrates a multi-step workflow using Nano Banana for realistic base images and Kling AI for video generation, with specific techniques to avoid the "AI look".
| Problem | Solution |
|---|---|
| Too perfect/clean skin | Add imperfections: micro-pores, natural oils, fine lines |
| Studio lighting | Use available/natural light, mixed color temps |
| Character too still | Add micro-movements, head tilts, natural sway |
| Inconsistent pacing | Use 55-60 syllables per clip |
| Robotic voice | Process through Adobe Podcast or Resemble AI |
| Obvious jump cuts | Cover with B-roll or animations |
| Weird AI hands | Crop hands out of frame or keep completely static |
AI video models (Kling, Veo, etc.) struggle with realistic hand movement. Fingers morph, gestures look unnatural, and hands are often the biggest tell.
Best practice: Keep hands OUT of frame or completely static.
Options:
Before starting, gather from the user:
Use the Nano Banana skill to generate the character image. Critical: Apply the imperfection techniques from CHARACTER-PROMPTING.md.
Key elements for realistic UGC:
Command:
~/.claude/skills/nano-banana/scripts/generate.sh "[enhanced prompt]" --aspect 9:16 --size 2K
Then optionally upscale through Enhancor AI for additional texture.
This is the most important step for natural pacing. See SCRIPT-CHUNKING.md.
The 55-60 Syllable Rule:
Example chunking:
Chunk 1 (58 syllables):
"Hey everyone, I wanted to share something that completely changed how I think about productivity. It's not another app or system."
Chunk 2 (56 syllables):
"It's actually about understanding your own energy patterns throughout the day. Once I figured this out, everything clicked into place."
For each script chunk, generate a 10-second video clip. Use the Kling AI skill with movement prompts from MOVEMENT-PROMPTING.md.
Key elements:
Spawn background agent for each clip:
Task tool:
- subagent_type: "general-purpose"
- run_in_background: true
- prompt: [Include image URL, script chunk, movement prompt, output path]
See POST-PRODUCTION.md for detailed guidance.
Before generating full video, test the workflow with one clip:
A vertical 9:16 UGC-style video frame captured on an iPhone 11 resting on a tripod.
Medium-wide portrait at true eye-level with slightly forward-leaning posture.
[CHARACTER]: A [age] [gender] with [ethnicity] complexion, [eye color] eyes beneath
[eyebrow description], [nose description]. [Jawline/facial hair]. [Hair description].
Expression is [emotion]—[specific expression details].
[CLOTHING]: [Fitted/casual garment], [collar detail].
[SKIN TEXTURE]: Visible pores across T-zone, faint smile lines, natural oils catching
light on forehead and nose. [Age-appropriate details]. No filter, no foundation.
[FOREGROUND]: Hands rest naturally on [surface], fingers relaxed, visible veins and
knuckle texture. Nearby: [everyday objects like water bottle, phone, notebook].
[CAMERA]: Native iPhone 11 lens (26mm equivalent), slightly wide perspective, mild
barrel softness at edges. Only tiny pockets of neural blur around hair edges.
[LIGHTING]: Available light mix—cool overcast daylight from window left, warm tungsten
from desk lamp right. Soft asymmetric shadows, natural falloff. ISO noise 500-900.
[BACKGROUND]: [Realistic home/office elements]—bookshelf, [furniture], clearly visible
not heavily blurred.
[REALITY DETAILS]: Gentle 35mm film grain, light fingerprint smudge on lens, tiny dust
haze in air. No cinematic bloom, no studio finish.
Styling: raw UGC realism, available indoor light, mixed color temperature, minimal
depth blur, visible ISO noise, emphasis on authenticity.
Hand "Home Base" Protocol: Hands default to Active Idle. Fingers shift, thumbs rub,
wrists rotate slightly while anchored. Gestures only for key emphasis.
[0.0s-0.5s] Pre-roll: Sharp inhale, eyes lock to lens, head still
[0.5s-3.0s] Hands in Active Idle (fingers interlocked), head tilts slightly right,
brows furrow in seriousness
[3.0s-6.0s] Hands break clasp for quick open-palm rotation then return, head drifts
forward, natural blink
[6.0s-8.0s] Hands return to Active Idle (loose clasp, thumbs tapping), head nods
encouragingly, cheeks lift in natural smile
[8.0s-10.0s] Hands anchored (wrist shifts), chin lifts in quick final nod, natural blink
[Script]: "[CHUNK TEXT HERE]"
[Tone]: [Urgent/Conversational/Professional/etc.]
[Pacing]: Rapid fire delivery, high energy, viral UGC style, confident, 2x speed
For simpler long-form videos, consider InfiniteTalk (infinitetalk.ai).
Pros: Single generation for longer videos Cons: Less control over pacing (no timer/duration control), charged by output length
Use the syllable method above when precise pacing control is needed.
juliusbrussee/caveman
mattpocock/skills
shadcn/improve
obra/superpowers
forrestchang/andrej-karpathy-skills
vercel-labs/skills