Remotion + React 19 + AI

HackStudio Pro

AI-powered video production studio that turns research into broadcast-ready YouTube documentaries — in two languages, from a single codebase.

17+
Research Platforms
2
Languages (CN/EN)
1
Codebase
0
Video Editors
5
Pre-Render Checks

Video production is a software problem

Making one high-quality documentary takes weeks. Doing it in two languages doubles the work. Scaling to 100+ videos? Impossible by hand. HackStudio Pro collapses that timeline by treating every part of the pipeline as code.

Scripts are TypeScript data

Not timelines dragged in Premiere. Content lives in .ts files with full type safety and IDE autocomplete.

Timing is computed from audio

Not dragged on a scrubber. Word-level timestamps from MiniMax TTS drive the entire timeline automatically.

Animations are React components

Version-controlled, composable, reusable. Charts, diagrams, and maps render as JSX over B-roll backgrounds.

Bilingual output is a prop change

lang="cn" or lang="en" — same pipeline, same components, same render command.

From raw sources to rendered .mp4

Nine phases. Three of them — Concept, Editor Pass, Validation — exist because we learned the hard way that render time is too expensive to waste on avoidable mistakes.

Phase 01.5

Concept

Every video starts with an editorial angle — what's the gap between how Chinese and Western audiences see this story? video-concept.md pins down spine and tone before any researcher runs.

Human-First
Phase 01

Research

AI agents search 17+ platforms across Chinese and English ecosystems. Every claim gets bilingual triangulation with 3+ independent sources, saved into a dossier of transcripts, facts, perspectives, and visuals.

AI-Driven
Phase 02

Script

Research becomes structured TypeScript — narration lines, section titles, chart labels, and verified data points. Must sound spoken, not written: short sentences, no em dashes.

AI-Driven
Phase 03

B-Roll Sourcing

Videos from official channels only, analyzed frame-by-frame with Gemini 3.1 Flash Lite — fast, cheap, excellent Chinese OCR. Each clip gets a .analysis.md saved beside the .mp4.

AI-Driven
Phase 03.5

Editor Pass

The video-editor skill scores the script and auto-picks a documentary director persona — Adam Curtis, Errol Morris, or Alex Gibney — emitting role-tagged B-roll with director-voiced rationale.

AI + Human Review
Phase 04

TTS Generation

MiniMax T2A v2 generates voiceover with voice_modify for passionate delivery. Word-level timestamps mean every subtitle highlights in perfect sync with speech.

AI + Code
Phase 05

Build (Sequence Kinds)

Each line becomes a typed sequence — video, chart, title, quote, or ending. PartRenderer dispatches to focused renderers. Calm static backgrounds for data; moving B-roll for narrative.

Code
Phase 05.5

Validation Harness

Five pre-render checks: counts consistency, TTS integrity, breathing time, B-roll overlap, text density. Must pass before the expensive render. Catches what humans miss.

Code
Phase 06

Render

One remotion render command outputs broadcast-ready .mp4 in both Chinese and English versions. Add --gl=angle --concurrency=1 for Mapbox maps.

Code

Sequences have kinds now

A Part used to be "video background with chart overlays". That model broke the moment data got complex — charts fought moving B-roll for attention, and silent title cards dropped the audio. The new model treats each narration line as a typed entry with a kind that picks the right renderer and the right background.

kind: "video"

Moving B-roll

Standard narration. VideoBackground plays a B-roll clip with startFrom trimming. Glassmorphism caption floats on top.

kind: "chart"

Calm static canvas

Data visualizations get a StaticBackground — a tonal gradient that doesn't compete for attention. Breathing time validator enforces minimum 4 seconds.

kind: "title"

No more silent cards

Part titles are tied to a narration line (typically lineIdx: 0). Audio never drops out. Minimum 2.5s breathing time.

kind: "quote"

Pull quotes + key claims

Typography-first composition on calm background. Minimum 3.5s breathing time for the line to land.

kind: "ending"

Closing beat + CTA

Returns to moving B-roll for emotional punctuation. Consumes the final slot in the broll-manifest.

PartRenderer

The dispatcher

Shared across every video. Routes each SequenceEntry to one of five focused renderers. Data flows in as arguments — zero video-specific imports in src/shared/.

Reusable

Auto-director for B-roll

Picking B-roll is an editorial decision, not a technical one. The video-editor skill reads the script, scores it against three documentary director profiles, and picks the one whose voice matches the story. The result is B-roll that feels edited, not assembled.

Persona A

Adam Curtis

Systems, irony, archive juxtaposition. Picks texture clips that quietly undercut the narration. Works for stories about institutions or ideologies.

Systems
Persona B

Errol Morris

Human portraiture, close-ups, interrogative stillness. Favors faces and objects over action. Works for stories about individuals and their contradictions.

Portraiture
Persona C

Alex Gibney

Institutional accountability, evidence, tension. Favors documents, newsreels, official footage. Works for stories about power and its consequences.

Accountability

Each clip is assigned a roleanchor, texture, counterpoint, or transition — with a director-voiced rationale. A validator confirms the proposed distribution matches the chosen director's target role mix. Human review, rename .proposed.ts to broll-manifest.ts, move on.

Five checks that must pass before render

Remotion renders are expensive — minutes per Part, re-renders cost real time. The validation harness runs five static checks on the manifests before you ever spin up the encoder. 🔴 fatal blocks; ⚠ informational is a warning you can accept.

counts-consistency
Sequences, content, and alignment manifests must all agree on line count. Off-by-one errors here would mean subtitles dropping mid-sentence.
tts-integrity
Audio truncation or tail silence exceeding 0.08 amplitude within a 200ms window — the MiniMax model cut off the sentence before it finished.
breathing-time
Minimum sequence durations: chart ≥ 4s, title ≥ 2.5s, quote ≥ 3.5s. Anything less reads as rushed.
broll-overlap
Two sequences may never share an overlapping time range from the same source file. Visual repetition kills editorial trust.
text-density
B-roll clip start-frame must land before 67% of the text. If the image only arrives after the caption finishes, there's no breathing room.

Every video is self-contained

Shared rendering infrastructure is reused across all videos. Adding a new video = new folder + data files + one import.

src/
├── shared/               # Reusable across ALL videos
├   ├── components/     # PartRenderer, SubtitleOverlay...
├   ├── lib/            # colors, fonts, timing, audio math
├   └── schemas/        # VideoSchema (lang: cn | en)
└── videos/
    └── xiaomi-su7/     # One folder per video
        ├── index.tsx     # Composition registry
        ├── components/   # Parts + animated overlays
        └── data/         # Scripts, B-roll, audio, charts

public/<slug>/          # Assets namespaced per video
    ├── audio/{cn,en}/  # TTS .mp3 files
    └── videos/         # B-roll .mp4 files

 Audio-Driven Timeline

Sequence durations are computed from TTS output. Change the script and timing updates automatically — no manual scrubbing.

 Adaptive Animation Timing

All overlays use useTimeScale() so keyframes scale proportionally to the actual sequence duration.

 TypeScript Data, Not JSON

Type safety, imports, and IDE autocomplete for all content, manifests, and chart data.

 Zero Clip Repetition

B-roll validation ensures no two sequences share overlapping time ranges from the same source file.

"Precision Editorial"

A cinematic visual language inspired by modern data journalism and Xiaomi's product design precision.

#131313
#1C1B1B
#2A2A2A
#393939
#9DCAFF
#FFB595
#FF6700

No Borders

Boundaries through tonal shifts, negative space, and radial gradients. Ghost borders at 15% opacity only when required for accessibility.

Glassmorphism

Floating cards use semi-transparent surfaces with backdrop-blur: 20-40px over video backgrounds.

Ambient Glows

No standard drop shadows. Instead: 60-80px blur, 6-10% opacity, tinted with surface color — never pure black.

135° Gradient Accents

CTAs and data highlights use a warm gradient from #FFB595 to #FF6700 at 135 degrees.

The Stack

Video Engine
Remotion 4.0
UI Framework
React 19 + TailwindCSS v4
Language
TypeScript (strict)
Text-to-Speech
MiniMax T2A v2
Transcription
OpenAI Whisper
Video Analysis
OpenRouter Vision
Research
17+ AI Agents
Runtime
Bun

Key Decisions

Sequence kind discriminated union
A Part is no longer "video with chart overlays" — it's a sequence of typed entries. Chart, title, and quote kinds use a calm StaticBackground instead of fighting moving B-roll for attention.
No silent title cards
Part titles are their own kind: "title" sequence tied to a narration line. Audio never drops out.
Auto-director for B-roll
The video-editor skill picks Curtis / Morris / Gibney based on story shape. B-roll feels edited, not assembled.
Pre-render validation harness
Five static checks on manifests before the expensive render. Counts, TTS integrity, breathing time, overlap, text density.
One audio file per part
Natural prosody — MiniMax produces better speech when it sees the full paragraph context, not isolated sentences.
Word-level timestamps
Subtitle highlighting syncs to actual speech timing from MiniMax, not estimated durations.
Audio-driven timeline
Sequence durations computed from TTS output. All animations use useTimeScale() so keyframes stay proportional.
Gemini 3.1 Flash Lite for vision
Swapped in as the default B-roll analyzer after testing alternatives. Fast, cheap, excellent Chinese OCR.
TypeScript data, not JSON
Full type safety, imports, and IDE autocomplete for all content and manifests.

Quick Start

Open this repo in Claude Code, Cursor, or any IDE with an AI agent. Tell it what video you want to make. It walks the pipeline with you — research, script, B-roll, TTS, render.

Prerequisites. Bun installed locally. Claude Code (or another agent-capable IDE) configured. API keys for MiniMax TTS, OpenAI Whisper, and OpenRouter in your shell environment.
01Concept
I want to make a video about Xiaomi SU7. Before any research, develop the editorial angle. What's the gap between how Chinese and Western audiences see this car?
Claude writes video-concept.md — the story spine, tone, and Part structure. Never start research without this.
02Research
Research this across both language ecosystems. Use agent-reach for Bilibili, Weibo, WeChat, Xueqiu. Use tavily-research for the English side. Build the dossier in research/xiaomi-su7/.
Produces transcript.md, facts.md, perspectives.md, visuals.md — triangulated across 3+ sources per claim.
03Script
Draft the bilingual script from the research. Four to five Parts. Seven to ten lines each. Make it sound spoken, not written. No em dashes.
Writes content-cn.ts, content-en.ts, and chart-data.ts with verified values.
04B-Roll
Source B-roll from official channels only. Download with yt-dlp. Analyze each clip with video-describe-fast using Gemini 3.1 Flash Lite and the --context flag so it knows what to look for.
Saves .analysis.md beside each .mp4 with OCR and entity inference.
04.5Editor Pass
Run the video-editor skill to auto-pick a director persona and tag every B-roll clip with its emotional role. Then review the proposal and promote it.
Produces broll-manifest.proposed.ts with role tags and director-voiced rationale per clip. Override auto-pick with --director curtis|morris|gibney.
05TTS + Build
Generate TTS with MiniMax at passionate intensity. Then build the Part components using sequence kinds — video, chart, title, quote, ending — and wire them into the MasterComposition.
Calls generate-tts.ts, writes alignment-manifest.ts, and scaffolds each Part following design.md.
05.5Validate
Run the 5-check validation harness before render. Fix anything fatal. Review warnings.
Calls validate-video.ts with checks for counts, TTS integrity, breathing time, B-roll overlap, and text density.
06Render
Render both language versions to mp4. Add --gl=angle --concurrency=1 if any Part contains a Mapbox map.
Outputs broadcast-ready .mp4 files in both CN and EN.
Commands the agent runs for you
# Install dependencies
bun install

# Preview in Remotion Studio
bun run dev

# Generate TTS for a video (MiniMax T2A v2 with word-level timestamps)
bun run scripts/generate-tts.ts --video xiaomi-su7

# Editor Pass — auto-director B-roll tagging
/video-editor --video xiaomi-su7

# Pre-render validation harness — 5 checks must pass
bun run scripts/validate-video.ts --video xiaomi-su7

# Render final video (Chinese + English)
bunx remotion render XiaomiSU7-CN --codec=h264
bunx remotion render XiaomiSU7-EN --codec=h264