The stack · transparency

Every model has limits. Here are ours.

Most AI products won’t tell you what their model can’t do. We will. Below: every layer in the Hybrig pipeline, in the order data flows through it. What each layer does, what it CAN’T do, and which other layers it stacks on. If a layer is mediocre at hands, needs 15 photos to train, or refuses real-person likeness via cloud content policy — this page says so.

For the per-model catalog (Flux dev vs. Flux schnell vs. Pony vs. SD 3.5 etc.), see /models. This page is the architecture: how those models stack into shipping pipelines.

Architecture

Nine layers, top to bottom

Data flows downward. Cloud-tinted bands lean on a hosted service; everything else runs on your GPU. The rectangles ARE the pipeline — every render touches every band, in this order, whether you click one button or ten.

Layer by layer

What each layer does, what it can’t

Each card is the same shape: plain-English description, the explicit limits we’ve hit in production, and the layers above and below that it stacks with.

Your inputs
local
A seed idea, a script, reference photos of the person, optional reference audio. Everything starts here. None of this leaves your machine until a layer downstream specifically calls out to a cloud service (those layers are flagged with a CLOUD badge below).
What it can’t do
- Generate anything by itself — this layer is just the raw material the rest of the stack works from.
- Read your mind. Vague seeds ("make a cool ad") fan out into vague scripts. Specific seeds ("30-second testimonial about a roofing job, casual tone, no jargon") give the rest of the stack something to lock onto.
Stacks with
- Feeds into the script layer (Claude) when you want N variants from one seed.
- Feeds reference photos into the identity layer (LoRA training or PuLID-Flux conditioning).
- Feeds reference audio into the voice layer (F5-TTS local clone, or skip TTS entirely with bring-your-own-audio; ElevenLabs is wired as a cloud fallback).
Script generation (Claude)
cloud
Anthropic's Claude API turns one seed idea into N script variants — different hook, different angle, different emotional register. Word count is targeted to your clip duration (around 2.5-3 words/sec for natural cadence) so you don't get a 60-word monologue jammed into an 8-second slot. Translation pass is the same provider, separate prompt.
What it can’t do
- Run locally — Claude is cloud-only by API. The brain slot accepts local providers too (Ollama, llama.cpp, Gemma); we default to Claude on multi-constraint rewriter jobs where small models silently drop rules. See /learn/cloud-llm-tradeoffs for the full breakdown.
- Match a brand voice you've never described. Tone-preference and avoid-phrases inputs help; full brand-voice cloning needs a longer style sheet.
- Replace a copywriter on a campaign that depends on insider context the LLM doesn't have.
Stacks with
- Each script's word count drives the video layer's clip duration.
- Each script feeds the voice layer (TTS) verbatim.
- Script + industry preset → /api/hybrig/batch fans out N renders in one queue.
Still-image generation (Flux / SDXL / SD 3.5 / PixArt)
local
All seven still-image models run on your own GPU via ComfyUI. Default is Flux dev for photoreal portraits. SDXL+Pony for stylized, SDXL+Illustrious for anime, SD 3.5 Large for cinematic, HiDream-I1 when Flux falls short on detail, Flux schnell for fast drafts, PixArt-Σ for low-VRAM rigs. Full per-model breakdown is on /models.
What it can’t do
- Render video. Stills only — feed the keeper into the video layer (image-to-video).
- Lock identity by themselves. The base models will give you 'a person who looks vaguely like the reference' without a LoRA or PuLID-Flux on top.
- Run on a 6 GB GPU for the heavy models. Flux dev needs ~12 GB; HiDream needs ~16 GB. PixArt-Σ is the 6 GB option.
Stacks with
- Feeds the identity layer — the still gets re-rendered with LoRA / PuLID / Redux conditioning for face lock.
- The keeper still becomes the reference frame for image-to-video in the video layer.
Identity & conditioning (LoRA / PuLID / Redux / IPAdapter / Wardrobe)
local + cloud
The hardest problem in AI video isn't generating one good frame — it's generating the SAME face across fifty frames, twenty shots, six campaigns. Hybrig stacks five identity layers, ranked by lock strength: trained LoRA (strongest), PuLID-Flux (single-photo geometry lock), Flux Redux (style+composition), IPAdapter (SDXL family), and wardrobe-lock via Gemini 2.5 Flash Image (the one cloud entry in this layer, used only for outfit swaps).
What it can’t do
- Trained LoRA: needs 15-30 varied photos. 10 photos in one outfit collapses the LoRA to a generic face. 1-2 hours of GPU time to train.
- PuLID-Flux: locks face geometry well, but won't carry hairstyle or clothing — those drift. Recommended strength 0.9-1.0 per the v0.9.1 README.
- Flux Redux: drags the reference's backdrop, pose, AND clothing into the output. Use it for variations of an existing photo — NOT for putting your face in a new place. PuLID is the better tool for new-scene work.
- IPAdapter: SDXL-family only. It never sees a Flux model.
- Wardrobe-lock (Gemini): cloud-only — files leave the device for this one step. Outfit fidelity drops on heavily patterned or branded garments. Won't change pose or scene.
Stacks with
- LoRA + Flux dev = strongest still-image identity lock available on Hybrig.
- LoRA + Wan 2.2 = identity-locked video on free local.
- PuLID + scene prompt = drop a face into a new scene without training a LoRA first.
- PuLID and Redux conflict on the same Flux conditioning slot — Hybrig picks PuLID when both are available.
- Gemini wardrobe edit → re-shoot the edited image into a LoRA training set when you want a new wardrobe permanently locked into the character.
Video generation (Wan 2.2 / Wan 2.2 FLF2V / Wan 2.1 / Hunyuan / LTX / Seedance / Gemini Omni)
local + cloud
Default is Wan 2.2 — runs on your GPU, free forever, 3-8 min per 5-second clip on a 4090. The auto-fallback chain tries Wan 2.2 → Wan 2.1 → Hunyuan → LTX before reaching for cloud. Wan 2.2 FLF2V is the dedicated first/last-frame interpolation node — feed two images (start + end), get the morph clip between them; only LOCAL model with this capability. Cloud has two distinct roles, not one: Seedance 2.0 (ByteDance via fal.ai) is the rescue valve for face lock + lip-sync; Gemini Omni (Google DeepMind, via Google AI subscription) is the physics + world-knowledge polish for hero beats. BOTH are polish-after-farming, never the farm engine — cloud per-render economics structurally break at cohort scale. Wan 2.2 Animate and VEO are also wired through fal as alternative cloud options.
What it can’t do
- Wan 2.2: ~10-20% behind Seedance on tight close-ups; weaker on hands and crowd scenes. No native lip-sync — needs the lipsync layer below.
- Wan 2.2 FLF2V: requires Wan 2.2 Q5_K_M GGUF (~12 GB quantized build) AND kijai/WanVideoWrapper custom nodes installed in ComfyUI. Both source images must have identical framing/composition or the morph looks broken. Caps at 5 sec output.
- Wan 2.1: ~70-80% of Seedance quality; kept around as fallback when 2.2 misbehaves on a specific install. ALSO — for standard image-to-video on a 4090, 2.1 still beats 2.2's quantized build in our tests. Newer is not always better; pick by capability needed, not version number.
- HunyuanVideo: 6-12 min per 5-second clip — slower than Wan. Identity lock weaker on close-ups. No native lip-sync.
- LTX-Video: visibly lower quality than Wan 2.x or Hunyuan. Best for drafting motion direction, not finals.
- Seedance: cloud-only, billed per second of output. Failed renders still consume the credit. Backgrounds drift on clips longer than ~6 sec. AND Seedance refuses real-person likeness via content policy on premium SaaS routes — this is part of WHY Hybrig defaults to local: cloud video models are increasingly locked down on real faces.
- Gemini Omni: subscription-gated (Google AI), region-limited, allowlist-controlled. Per-render metered. CANNOT be used as a farming engine — unit economics break the moment you multiply by cohort size. Polish layer only, applied after local farm has finalized the render direction, on the one or two hero beats where physics simulation or world-knowledge compositing actually buys you something local can't fake.
Stacks with
- Image-to-video: keeper still from the image layer + LoRA conditioning → silent video clip.
- First/last frame interpolation: two matched Flux stills (e.g. fresh shingle year 0 + aged shingle year 25) → Wan 2.2 FLF2V → local weathering time-lapse, no cloud.
- Wan 2.2 + LoRA = identity-locked video on free local.
- Wan 2.2 for drafts, Seedance for the keeper finals when budget allows.
- Wan farm settles direction → identify the one or two hero beats that need physics or world-knowledge compositing → Omni polish on those beats only.
- Output goes to the lipsync layer if the script needs visible mouth movement, or skips straight to mux for B-roll.
Lipsync (EchoMimicV2 / LatentSync / fal-sync)
local + cloud
Three lipsync paths, ranked by where the work runs: EchoMimicV2 regenerates the upper body from a single reference frame + audio (highest fidelity for spokesperson shots, but loses background motion); LatentSync 1.6 edits the mouth region of an existing video (preserves background, lower fidelity than EchoMimic); fal-sync is the cloud sync-lipsync API (production default, billed per render). Hybrig probes ComfyUI's /object_info to detect which local lipsync nodes are actually installed before submitting a workflow.
What it can’t do
- EchoMimicV2: regenerates the body from one frame, so you LOSE any background motion that was in the input video. Trade-off — better lipsync, worse continuity.
- EchoMimicV2: requires sd-vae-ft-mse VAE in models/vae/, four .pth files in models/echo_mimic/v2/, plus whisper_tiny.pt and the sd-image-variations init UNet. If any of those are missing the /object_info probe disables the local path silently.
- LatentSync: lipsync only. Doesn't add expressions, doesn't fix gaze drift. Edits the mouth region — leaves the rest of the frame alone.
- fal-sync: cloud-only, billed per render, and a failed render still consumes the credit.
- All three: doesn't match emotion across long takes. If the script swings from calm to angry, the lipsync handles phonemes but the FACE keeps the input video's emotional register.
Stacks with
- Video layer output (silent clip) + voice layer output (audio) → lipsync layer → mouth-synced clip.
- Local-first routing: pipeline tries EchoMimicV2 → LatentSync → fal-sync if enableLipsync=true.
- Skip this layer entirely on B-roll, voiceover-only, or ambient-music shots.
Voice (F5-TTS local clone / BYO audio / ElevenLabs fallback)
local
Three voice paths. F5-TTS is the default — clones from a 10–30 second sample, runs on your GPU, audio never leaves the machine. Hybrig's F5-TTS node ships per-sentence chunking with a 250ms silence pad between each by default, because long run-on input makes every TTS overlap sentences (cloud or local). Bring-your-own-audio is the free path: upload a recording, Hybrig muxes it onto the video. ElevenLabs Multilingual v2 is wired as a cloud fallback when F5-TTS isn't installed or the user wants its prosody.
What it can’t do
- F5-TTS: ~10–30s of clean reference audio is mandatory; noisy or compressed samples produce noisy clones. Long run-on text without sentence-level chunking still causes overlap — the chunker fixes the call shape, not the input prose.
- BYO audio: requires you to record clean audio yourself (any USB mic or the Sony A7 IV's XLR does the job). No clone — every script is a fresh take.
- ElevenLabs: cloud-only — script + voice sample upload to ElevenLabs servers. Subscription required. Failed renders still consume credits. Used as the cloud fallback path now, not the default.
Stacks with
- Script layer text → voice layer audio → lipsync layer (if enabled) → mux.
- F5-TTS chunked-VO benchmark observed 2026-05-08: chunking + silence inserts ≈ 2× compute time (one model invocation per sentence vs. one for the whole block) with a categorical quality jump in sentence-to-sentence transitions. Same model, same voice clone, same speed setting — only the input segmentation changes.
- BYO audio is the cheapest, most private path: record yourself once, queue 50 renders, pay $0 in voice fees.
- Full technique walk-through (punctuation discipline, what works and what doesn't): /learn/voice-cadence.
Mux & post (ffmpeg-api / local ffmpeg)
local + cloud
The audio + video combiner. fal-ai/ffmpeg-api/merge-audio-video is the default cloud mux (fast, no local ffmpeg dependency); local ffmpeg is wired as a fallback for offline / private workflows. This layer is dumb plumbing — it doesn't generate anything, it just sticks the audio onto the silent video.
What it can’t do
- Generate any media of its own. Strictly mux + cut + format conversion.
- Fix a sync mismatch. If the audio is 8.2 sec and the video is 7.9 sec, the mux just cuts the longer one. Lipsync and clip-duration alignment happen upstream.
Stacks with
- Lipsync output (or video output if no lipsync) + voice output → muxed mp4 with audio.
- Output goes to the export layer for final delivery.
Export & delivery
local
Final muxed mp4 lands in your storage bucket (Supabase) and on your dashboard. Platform-export presets (vertical 9:16 for TikTok / Reels, square 1:1 for feed, horizontal 16:9 for YouTube) re-frame the keeper without re-rendering. Originals stay on disk — you own the file forever.
What it can’t do
- Auto-publish to social. By design — Hybrig is a render tool, not a scheduler.
- Apply brand-style color grading. If you need DaVinci Resolve grading (the user's workflow), the mp4 export is the handoff to your NLE.
Stacks with
- Mux output → platform-export reframe → final delivery file.
- Files persist to Supabase storage with signed-URL refresh so dashboard thumbnails don't go blank after a week.

Benchmark callout · the brain slot

The pluggable LLM slot is the most-debated part of this stack. Local models in the 7B–13B range excel on summarization, transcription, classification, tagging, short generation — the bulk of what an LLM gets asked to do inside a pipeline. They choke on multi-constraint generation with hard never-rules.

The watchOut field in a typical Deep Intel cohort is the canonical example: eight simultaneous “do not say X” constraints per prospect, where a small model silently drops a subset and the failure rate on cohort-scale generation runs in our testing at roughly 15–30% without anyone noticing until QA. Frontier cloud models hold all eight cleanly. Hybrig picks per task; users override.

Full breakdown — constraint-holding capacity, negation handling, instruction-tuning depth, effective attention across long structured input — in When to use a local LLM and when to reach for cloud.

Model lineage

Each foundational model, every public version

Hybrig doesn’t hide the model history behind a brand name. Below: every public version of the five foundational models in the local stack, in release order, with the one-line positioning each version was the first to deliver. The highlighted card is the build Hybrig actually ships with today.

Wan (Alibaba)

Local video generation — image-to-video and text-to-video.

v01 · 2025
Wan 2.1 (14B)
First Wan release that hit production-grade i2v on a single 24 GB GPU. Hybrig's reliable fallback when 2.2 misbehaves on a specific install.
v02 · 2025current
Wan 2.1 GGUF Q5_K_M
Quantized port that lets the 14B model run on a 4090 with VRAM headroom. The default Wan build Hybrig ships with.
v03 · 2026
Wan 2.2
Sharper close-ups and better motion than 2.1; full weights don't fit a single 24 GB card so Hybrig holds at 2.1 for now. Tracked for future swap.
v04 · 2026
Wan 2.2 Animate
Cloud variant on fal.ai with stronger motion vocabulary. Used as a contingency, never the default.

Wan model choice memorialised in project_wan_model_choice.md — Hybrig pins Wan 2.1 14B Q5_K_M GGUF on the 4090.

Flux (Black Forest Labs)

Local still-image generation — photoreal portraits and scenes.

v01 · 2024current
Flux.1 dev
The first open-weights Flux release that beat SDXL on photoreal faces. Hybrig's default still-image base.
v02 · 2024
Flux.1 schnell
4-step distilled variant for fast drafts. Slight quality dip; same prompt grammar as dev.
v03 · 2024
Flux.1 Redux
Image-conditioning sibling — drags the reference's backdrop, pose AND clothing into the output. Use for variations of an existing photo, not for new-scene work.
v04 · 2024
Flux.1 Fill / Canny / Depth
ControlNet-style conditioners: inpainting + structural guidance. Lets the same Flux base do edits without retraining.
v05 · 2025
Flux.1 Kontext
In-context image editing — describe the change in plain English, base model executes. Roadmap pickup once the open weights stabilise.

Flux versions sourced from the BFL public release notes; verify before quoting on press pages.

F5-TTS

Local voice cloning — flow-matching TTS that runs on your GPU.

v01 · 2024
F5-TTS Base
First public release. Flow-matching architecture; clones a voice from ~30 seconds of clean reference audio.
v02 · 2025current
F5-TTS v1.0
Stabilised checkpoint with better prosody on long takes. The build Hybrig ships with by default.

F5-TTS is the local-first voice path per project_voice_local_first.md. ElevenLabs is cloud contingency only.

EchoMimic

Local lipsync — regenerates the upper body from one reference frame + audio.

v01 · 2024
EchoMimic V1
Original release — face-only lipsync from a single still + audio. Worked but lost identity on long takes.
v02 · 2025current
EchoMimicV2
Upper-body extension. Higher fidelity than V1 but discards background motion from the input clip — trade-off Hybrig calls out explicitly.

EchoMimicV2 install requirements documented in src/app/stack/page.tsx layer 'lipsync'.

LatentSync (ByteDance)

Local lipsync — edits the mouth region of an existing video.

v01 · 2024
LatentSync 1.0
First public release. Mouth-region inpainting via a SyncNet-supervised diffusion model.
v02 · 2025
LatentSync 1.5
Tighter sync on fast speech; reduced jaw flicker on profile shots.
v03 · 2025current
LatentSync 1.6
The build Hybrig ships with. Best fidelity-to-speed ratio of the family; preserves background motion that EchoMimicV2 throws away.

LatentSync is the second-stop on Hybrig's local lipsync chain (EchoMimicV2 → LatentSync → fal-sync).

Use-case galleries

What each model does, broken out by intent

The same model behaves differently when asked for editorial vs. lifestyle vs. cinematic output. Each shelf below is a sub-category of one foundational model — Realistic, Cinematic, Editorial, Lifestyle, Diversity. Thumbs marked PLACEHOLDER are stand-ins until the per-category renders land in public/stack-gallery/.

Wan (Alibaba)

Local video generation — image-to-video and text-to-video.

Realistic

4 samples

Hand-held selfie cadence, available-light interiors, pickup-truck B-roll. The bread-and-butter Hybrig render.

placeholder
placeholder
placeholder
placeholder

Cinematic

4 samples

Anamorphic-style camera moves, slow dolly-in, late-day sun. Wan handles motion direction better than identity here — pair with a strong still.

placeholder
placeholder
placeholder
placeholder

Lifestyle

4 samples

Walk-and-talk, kitchen counter explainers, jobsite drive-up. Default cadence for spokesperson-style work.

placeholder
placeholder
placeholder
placeholder

Diversity

4 samples

Different faces, different rigs (car, van, pickup), different time-of-day. Wan's identity wobble shows up here — a good stress test.

placeholder
placeholder
placeholder
placeholder