HackStudio Pro

The Problem

Video production is a software problem

Making one high-quality documentary takes weeks. Doing it in two languages doubles the work. Scaling to 100+ videos? Impossible by hand. HackStudio Pro collapses that timeline by treating every part of the pipeline as code.

Scripts are TypeScript data

Not timelines dragged in Premiere. Content lives in .ts files with full type safety and IDE autocomplete.

Timing is computed from audio

Not dragged on a scrubber. Word-level timestamps from MiniMax TTS drive the entire timeline automatically.

Animations are React components

Version-controlled, composable, reusable. Charts, diagrams, and maps render as JSX over B-roll backgrounds.

Bilingual output is a prop change

lang="cn" or lang="en" — same pipeline, same components, same render command.

End-to-End Pipeline

From raw sources to rendered .mp4

Nine phases. Three of them — Concept, Editor Pass, Validation — exist because we learned the hard way that render time is too expensive to waste on avoidable mistakes.

Phase 01.5

Concept

Every video starts with an editorial angle — what's the gap between how Chinese and Western audiences see this story? video-concept.md pins down spine and tone before any researcher runs.

Human-First

Phase 01

Research

AI agents search 17+ platforms across Chinese and English ecosystems. Every claim gets bilingual triangulation with 3+ independent sources, saved into a dossier of transcripts, facts, perspectives, and visuals.

AI-Driven

Phase 02

Script

Research becomes structured TypeScript — narration lines, section titles, chart labels, and verified data points. Must sound spoken, not written: short sentences, no em dashes.

AI-Driven

Phase 03

B-Roll Sourcing

Videos from official channels only, analyzed frame-by-frame with Gemini 3.1 Flash Lite — fast, cheap, excellent Chinese OCR. Each clip gets a .analysis.md saved beside the .mp4.

AI-Driven

Phase 03.5

Editor Pass

The video-editor skill scores the script and auto-picks a documentary director persona — Adam Curtis, Errol Morris, or Alex Gibney — emitting role-tagged B-roll with director-voiced rationale.

AI + Human Review

Phase 04

TTS Generation

MiniMax T2A v2 generates voiceover with voice_modify for passionate delivery. Word-level timestamps mean every subtitle highlights in perfect sync with speech.

AI + Code

Phase 05

Build (Sequence Kinds)

Each line becomes a typed sequence — video, chart, title, quote, or ending. PartRenderer dispatches to focused renderers. Calm static backgrounds for data; moving B-roll for narrative.

Code

Phase 05.5

Validation Harness

Five pre-render checks: counts consistency, TTS integrity, breathing time, B-roll overlap, text density. Must pass before the expensive render. Catches what humans miss.

Code

Phase 06

Render

One remotion render command outputs broadcast-ready .mp4 in both Chinese and English versions. Add --gl=angle --concurrency=1 for Mapbox maps.

Code

Architecture Highlight

Sequences have kinds now

A Part used to be "video background with chart overlays". That model broke the moment data got complex — charts fought moving B-roll for attention, and silent title cards dropped the audio. The new model treats each narration line as a typed entry with a kind that picks the right renderer and the right background.

kind: "video"

Moving B-roll

Standard narration. VideoBackground plays a B-roll clip with startFrom trimming. Glassmorphism caption floats on top.

kind: "chart"

Calm static canvas

Data visualizations get a StaticBackground — a tonal gradient that doesn't compete for attention. Breathing time validator enforces minimum 4 seconds.

kind: "title"

No more silent cards

Part titles are tied to a narration line (typically lineIdx: 0). Audio never drops out. Minimum 2.5s breathing time.

kind: "quote"

Pull quotes + key claims

Typography-first composition on calm background. Minimum 3.5s breathing time for the line to land.

kind: "ending"

Closing beat + CTA

Returns to moving B-roll for emotional punctuation. Consumes the final slot in the broll-manifest.

PartRenderer

The dispatcher

Shared across every video. Routes each SequenceEntry to one of five focused renderers. Data flows in as arguments — zero video-specific imports in src/shared/.

Reusable

New in Phase 3.5

Auto-director for B-roll

Picking B-roll is an editorial decision, not a technical one. The video-editor skill reads the script, scores it against three documentary director profiles, and picks the one whose voice matches the story. The result is B-roll that feels edited, not assembled.

Persona A

Adam Curtis

Systems, irony, archive juxtaposition. Picks texture clips that quietly undercut the narration. Works for stories about institutions or ideologies.

Systems

Persona B

Errol Morris

Human portraiture, close-ups, interrogative stillness. Favors faces and objects over action. Works for stories about individuals and their contradictions.

Portraiture

Persona C

Alex Gibney

Institutional accountability, evidence, tension. Favors documents, newsreels, official footage. Works for stories about power and its consequences.

Accountability

Each clip is assigned a role — anchor, texture, counterpoint, or transition — with a director-voiced rationale. A validator confirms the proposed distribution matches the chosen director's target role mix. Human review, rename .proposed.ts to broll-manifest.ts, move on.

New in Phase 5.5

Five checks that must pass before render

Remotion renders are expensive — minutes per Part, re-renders cost real time. The validation harness runs five static checks on the manifests before you ever spin up the encoder. 🔴 fatal blocks; ⚠ informational is a warning you can accept.

counts-consistency

Sequences, content, and alignment manifests must all agree on line count. Off-by-one errors here would mean subtitles dropping mid-sentence.

tts-integrity

Audio truncation or tail silence exceeding 0.08 amplitude within a 200ms window — the MiniMax model cut off the sentence before it finished.

breathing-time

Minimum sequence durations: chart ≥ 4s, title ≥ 2.5s, quote ≥ 3.5s. Anything less reads as rushed.

broll-overlap

Two sequences may never share an overlapping time range from the same source file. Visual repetition kills editorial trust.

text-density

B-roll clip start-frame must land before 67% of the text. If the image only arrives after the caption finishes, there's no breathing room.

Architecture

Every video is self-contained

Shared rendering infrastructure is reused across all videos. Adding a new video = new folder + data files + one import.

src/
├── shared/               # Reusable across ALL videos
├   ├── components/     # PartRenderer, SubtitleOverlay...
├   ├── lib/            # colors, fonts, timing, audio math
├   └── schemas/        # VideoSchema (lang: cn | en)
└── videos/
    └── xiaomi-su7/     # One folder per video
        ├── index.tsx     # Composition registry
        ├── components/   # Parts + animated overlays
        └── data/         # Scripts, B-roll, audio, charts

public/<slug>/          # Assets namespaced per video
    ├── audio/{cn,en}/  # TTS .mp3 files
    └── videos/         # B-roll .mp4 files

Audio-Driven Timeline

Sequence durations are computed from TTS output. Change the script and timing updates automatically — no manual scrubbing.

Adaptive Animation Timing

All overlays use useTimeScale() so keyframes scale proportionally to the actual sequence duration.

TypeScript Data, Not JSON

Type safety, imports, and IDE autocomplete for all content, manifests, and chart data.

Zero Clip Repetition

B-roll validation ensures no two sequences share overlapping time ranges from the same source file.

Design System

"Precision Editorial"

A cinematic visual language inspired by modern data journalism and Xiaomi's product design precision.

#131313

#1C1B1B

#2A2A2A

#393939

#9DCAFF

#FFB595

#FF6700

No Borders

Boundaries through tonal shifts, negative space, and radial gradients. Ghost borders at 15% opacity only when required for accessibility.

Glassmorphism

Floating cards use semi-transparent surfaces with backdrop-blur: 20-40px over video backgrounds.

Ambient Glows

No standard drop shadows. Instead: 60-80px blur, 6-10% opacity, tinted with surface color — never pure black.

135° Gradient Accents

CTAs and data highlights use a warm gradient from #FFB595 to #FF6700 at 135 degrees.

Engineering

Key Decisions

Sequence kind discriminated union

A Part is no longer "video with chart overlays" — it's a sequence of typed entries. Chart, title, and quote kinds use a calm StaticBackground instead of fighting moving B-roll for attention.

No silent title cards

Part titles are their own kind: "title" sequence tied to a narration line. Audio never drops out.

Auto-director for B-roll

The video-editor skill picks Curtis / Morris / Gibney based on story shape. B-roll feels edited, not assembled.

Pre-render validation harness

Five static checks on manifests before the expensive render. Counts, TTS integrity, breathing time, overlap, text density.

One audio file per part

Natural prosody — MiniMax produces better speech when it sees the full paragraph context, not isolated sentences.

Word-level timestamps

Subtitle highlighting syncs to actual speech timing from MiniMax, not estimated durations.

Audio-driven timeline

Sequence durations computed from TTS output. All animations use useTimeScale() so keyframes stay proportional.

Gemini 3.1 Flash Lite for vision

Swapped in as the default B-roll analyzer after testing alternatives. Fast, cheap, excellent Chinese OCR.

TypeScript data, not JSON

Full type safety, imports, and IDE autocomplete for all content and manifests.

Get Started

Quick Start

Open this repo in Claude Code, Cursor, or any IDE with an AI agent. Tell it what video you want to make. It walks the pipeline with you — research, script, B-roll, TTS, render.

Prerequisites. Bun installed locally. Claude Code (or another agent-capable IDE) configured. API keys for MiniMax TTS, OpenAI Whisper, and OpenRouter in your shell environment.

01Concept

I want to make a video about Xiaomi SU7. Before any research, develop the editorial angle. What's the gap between how Chinese and Western audiences see this car?

Claude writes video-concept.md — the story spine, tone, and Part structure. Never start research without this.

02Research

Research this across both language ecosystems. Use agent-reach for Bilibili, Weibo, WeChat, Xueqiu. Use tavily-research for the English side. Build the dossier in research/xiaomi-su7/.

Produces transcript.md, facts.md, perspectives.md, visuals.md — triangulated across 3+ sources per claim.

03Script

Draft the bilingual script from the research. Four to five Parts. Seven to ten lines each. Make it sound spoken, not written. No em dashes.

Writes content-cn.ts, content-en.ts, and chart-data.ts with verified values.

04B-Roll

Source B-roll from official channels only. Download with yt-dlp. Analyze each clip with video-describe-fast using Gemini 3.1 Flash Lite and the --context flag so it knows what to look for.

Saves .analysis.md beside each .mp4 with OCR and entity inference.

04.5Editor Pass

Run the video-editor skill to auto-pick a director persona and tag every B-roll clip with its emotional role. Then review the proposal and promote it.

Produces broll-manifest.proposed.ts with role tags and director-voiced rationale per clip. Override auto-pick with --director curtis|morris|gibney.

05TTS + Build

Generate TTS with MiniMax at passionate intensity. Then build the Part components using sequence kinds — video, chart, title, quote, ending — and wire them into the MasterComposition.

Calls generate-tts.ts, writes alignment-manifest.ts, and scaffolds each Part following design.md.

05.5Validate

Run the 5-check validation harness before render. Fix anything fatal. Review warnings.

Calls validate-video.ts with checks for counts, TTS integrity, breathing time, B-roll overlap, and text density.

06Render

Render both language versions to mp4. Add --gl=angle --concurrency=1 if any Part contains a Mapbox map.

Outputs broadcast-ready .mp4 files in both CN and EN.

Commands the agent runs for you

# Install dependencies
bun install

# Preview in Remotion Studio
bun run dev

# Generate TTS for a video (MiniMax T2A v2 with word-level timestamps)
bun run scripts/generate-tts.ts --video xiaomi-su7

# Editor Pass — auto-director B-roll tagging
/video-editor --video xiaomi-su7

# Pre-render validation harness — 5 checks must pass
bun run scripts/validate-video.ts --video xiaomi-su7

# Render final video (Chinese + English)
bunx remotion render XiaomiSU7-CN --codec=h264
bunx remotion render XiaomiSU7-EN --codec=h264

Video production is a software problem

Scripts are TypeScript data

Timing is computed from audio

Animations are React components

Bilingual output is a prop change

From raw sources to rendered .mp4

Concept

Research

Script

B-Roll Sourcing

Editor Pass

TTS Generation

Build (Sequence Kinds)

Validation Harness

Render

Sequences have kinds now

Moving B-roll

Calm static canvas

No more silent cards

Pull quotes + key claims

Closing beat + CTA

The dispatcher

Auto-director for B-roll

Adam Curtis

Errol Morris

Alex Gibney

Five checks that must pass before render

Every video is self-contained

Audio-Driven Timeline

Adaptive Animation Timing

TypeScript Data, Not JSON

Zero Clip Repetition

"Precision Editorial"

No Borders

Glassmorphism

Ambient Glows

135° Gradient Accents

The Stack

Key Decisions

Quick Start