The stack · transparency

Every model has limits. Here are ours.

Most AI products won’t tell you what their model can’t do. We will. Below: every layer in the Hybrig pipeline, in the order data flows through it. What each layer does, what it CAN’T do, and which other layers it stacks on. If a layer is mediocre at hands, needs 15 photos to train, or refuses real-person likeness via cloud content policy — this page says so.

For the per-model catalog (Flux dev vs. Flux schnell vs. Pony vs. SD 3.5 etc.), see /models. This page is the architecture: how those models stack into shipping pipelines.

Architecture

Nine layers, top to bottom

Data flows downward. Cloud-tinted bands lean on a hosted service; everything else runs on your GPU. The rectangles ARE the pipeline — every render touches every band, in this order, whether you click one button or ten.

InputsSeed · photos · audioScript (Claude)N variants from one seedCLOUDImage (Flux / SDXL / SD 3.5)Local — your GPUIdentity (LoRA / PuLID / Redux)Lock face across shotsVideo (Wan / Hunyuan / LTX → Seedance / Omni)Local farm, cloud polish on hero beatsLipsync (EchoMimic / LatentSync / fal-sync)Mouth on phonemesVoice (F5-TTS / BYO / ElevenLabs fallback)Local clone, or your own recordingMux + post (ffmpeg)Audio + video → mp4ExportYour file. You own it forever.
Layer by layer

What each layer does, what it can’t

Each card is the same shape: plain-English description, the explicit limits we’ve hit in production, and the layers above and below that it stacks with.

  • Your inputs

    local

    A seed idea, a script, reference photos of the person, optional reference audio. Everything starts here. None of this leaves your machine until a layer downstream specifically calls out to a cloud service (those layers are flagged with a CLOUD badge below).

    What it can’t do
    • Generate anything by itself — this layer is just the raw material the rest of the stack works from.
    • Read your mind. Vague seeds ("make a cool ad") fan out into vague scripts. Specific seeds ("30-second testimonial about a roofing job, casual tone, no jargon") give the rest of the stack something to lock onto.
    Stacks with
    • Feeds into the script layer (Claude) when you want N variants from one seed.
    • Feeds reference photos into the identity layer (LoRA training or PuLID-Flux conditioning).
    • Feeds reference audio into the voice layer (F5-TTS local clone, or skip TTS entirely with bring-your-own-audio; ElevenLabs is wired as a cloud fallback).
  • Script generation (Claude)

    cloud

    Anthropic's Claude API turns one seed idea into N script variants — different hook, different angle, different emotional register. Word count is targeted to your clip duration (around 2.5-3 words/sec for natural cadence) so you don't get a 60-word monologue jammed into an 8-second slot. Translation pass is the same provider, separate prompt.

    What it can’t do
    • Run locally — Claude is cloud-only by API. The brain slot accepts local providers too (Ollama, llama.cpp, Gemma); we default to Claude on multi-constraint rewriter jobs where small models silently drop rules. See /learn/cloud-llm-tradeoffs for the full breakdown.
    • Match a brand voice you've never described. Tone-preference and avoid-phrases inputs help; full brand-voice cloning needs a longer style sheet.
    • Replace a copywriter on a campaign that depends on insider context the LLM doesn't have.
    Stacks with
    • Each script's word count drives the video layer's clip duration.
    • Each script feeds the voice layer (TTS) verbatim.
    • Script + industry preset → /api/hybrig/batch fans out N renders in one queue.
  • Still-image generation (Flux / SDXL / SD 3.5 / PixArt)

    local

    All seven still-image models run on your own GPU via ComfyUI. Default is Flux dev for photoreal portraits. SDXL+Pony for stylized, SDXL+Illustrious for anime, SD 3.5 Large for cinematic, HiDream-I1 when Flux falls short on detail, Flux schnell for fast drafts, PixArt-Σ for low-VRAM rigs. Full per-model breakdown is on /models.

    What it can’t do
    • Render video. Stills only — feed the keeper into the video layer (image-to-video).
    • Lock identity by themselves. The base models will give you 'a person who looks vaguely like the reference' without a LoRA or PuLID-Flux on top.
    • Run on a 6 GB GPU for the heavy models. Flux dev needs ~12 GB; HiDream needs ~16 GB. PixArt-Σ is the 6 GB option.
    Stacks with
    • Feeds the identity layer — the still gets re-rendered with LoRA / PuLID / Redux conditioning for face lock.
    • The keeper still becomes the reference frame for image-to-video in the video layer.
  • Identity & conditioning (LoRA / PuLID / Redux / IPAdapter / Wardrobe)

    local + cloud

    The hardest problem in AI video isn't generating one good frame — it's generating the SAME face across fifty frames, twenty shots, six campaigns. Hybrig stacks five identity layers, ranked by lock strength: trained LoRA (strongest), PuLID-Flux (single-photo geometry lock), Flux Redux (style+composition), IPAdapter (SDXL family), and wardrobe-lock via Gemini 2.5 Flash Image (the one cloud entry in this layer, used only for outfit swaps).

    What it can’t do
    • Trained LoRA: needs 15-30 varied photos. 10 photos in one outfit collapses the LoRA to a generic face. 1-2 hours of GPU time to train.
    • PuLID-Flux: locks face geometry well, but won't carry hairstyle or clothing — those drift. Recommended strength 0.9-1.0 per the v0.9.1 README.
    • Flux Redux: drags the reference's backdrop, pose, AND clothing into the output. Use it for variations of an existing photo — NOT for putting your face in a new place. PuLID is the better tool for new-scene work.
    • IPAdapter: SDXL-family only. It never sees a Flux model.
    • Wardrobe-lock (Gemini): cloud-only — files leave the device for this one step. Outfit fidelity drops on heavily patterned or branded garments. Won't change pose or scene.
    Stacks with
    • LoRA + Flux dev = strongest still-image identity lock available on Hybrig.
    • LoRA + Wan 2.2 = identity-locked video on free local.
    • PuLID + scene prompt = drop a face into a new scene without training a LoRA first.
    • PuLID and Redux conflict on the same Flux conditioning slot — Hybrig picks PuLID when both are available.
    • Gemini wardrobe edit → re-shoot the edited image into a LoRA training set when you want a new wardrobe permanently locked into the character.
  • Video generation (Wan 2.2 / Wan 2.2 FLF2V / Wan 2.1 / Hunyuan / LTX / Seedance / Gemini Omni)

    local + cloud

    Default is Wan 2.2 — runs on your GPU, free forever, 3-8 min per 5-second clip on a 4090. The auto-fallback chain tries Wan 2.2 → Wan 2.1 → Hunyuan → LTX before reaching for cloud. Wan 2.2 FLF2V is the dedicated first/last-frame interpolation node — feed two images (start + end), get the morph clip between them; only LOCAL model with this capability. Cloud has two distinct roles, not one: Seedance 2.0 (ByteDance via fal.ai) is the rescue valve for face lock + lip-sync; Gemini Omni (Google DeepMind, via Google AI subscription) is the physics + world-knowledge polish for hero beats. BOTH are polish-after-farming, never the farm engine — cloud per-render economics structurally break at cohort scale. Wan 2.2 Animate and VEO are also wired through fal as alternative cloud options.

    What it can’t do
    • Wan 2.2: ~10-20% behind Seedance on tight close-ups; weaker on hands and crowd scenes. No native lip-sync — needs the lipsync layer below.
    • Wan 2.2 FLF2V: requires Wan 2.2 Q5_K_M GGUF (~12 GB quantized build) AND kijai/WanVideoWrapper custom nodes installed in ComfyUI. Both source images must have identical framing/composition or the morph looks broken. Caps at 5 sec output.
    • Wan 2.1: ~70-80% of Seedance quality; kept around as fallback when 2.2 misbehaves on a specific install. ALSO — for standard image-to-video on a 4090, 2.1 still beats 2.2's quantized build in our tests. Newer is not always better; pick by capability needed, not version number.
    • HunyuanVideo: 6-12 min per 5-second clip — slower than Wan. Identity lock weaker on close-ups. No native lip-sync.
    • LTX-Video: visibly lower quality than Wan 2.x or Hunyuan. Best for drafting motion direction, not finals.
    • Seedance: cloud-only, billed per second of output. Failed renders still consume the credit. Backgrounds drift on clips longer than ~6 sec. AND Seedance refuses real-person likeness via content policy on premium SaaS routes — this is part of WHY Hybrig defaults to local: cloud video models are increasingly locked down on real faces.
    • Gemini Omni: subscription-gated (Google AI), region-limited, allowlist-controlled. Per-render metered. CANNOT be used as a farming engine — unit economics break the moment you multiply by cohort size. Polish layer only, applied after local farm has finalized the render direction, on the one or two hero beats where physics simulation or world-knowledge compositing actually buys you something local can't fake.
    Stacks with
    • Image-to-video: keeper still from the image layer + LoRA conditioning → silent video clip.
    • First/last frame interpolation: two matched Flux stills (e.g. fresh shingle year 0 + aged shingle year 25) → Wan 2.2 FLF2V → local weathering time-lapse, no cloud.
    • Wan 2.2 + LoRA = identity-locked video on free local.
    • Wan 2.2 for drafts, Seedance for the keeper finals when budget allows.
    • Wan farm settles direction → identify the one or two hero beats that need physics or world-knowledge compositing → Omni polish on those beats only.
    • Output goes to the lipsync layer if the script needs visible mouth movement, or skips straight to mux for B-roll.
  • Lipsync (EchoMimicV2 / LatentSync / fal-sync)

    local + cloud

    Three lipsync paths, ranked by where the work runs: EchoMimicV2 regenerates the upper body from a single reference frame + audio (highest fidelity for spokesperson shots, but loses background motion); LatentSync 1.6 edits the mouth region of an existing video (preserves background, lower fidelity than EchoMimic); fal-sync is the cloud sync-lipsync API (production default, billed per render). Hybrig probes ComfyUI's /object_info to detect which local lipsync nodes are actually installed before submitting a workflow.

    What it can’t do
    • EchoMimicV2: regenerates the body from one frame, so you LOSE any background motion that was in the input video. Trade-off — better lipsync, worse continuity.
    • EchoMimicV2: requires sd-vae-ft-mse VAE in models/vae/, four .pth files in models/echo_mimic/v2/, plus whisper_tiny.pt and the sd-image-variations init UNet. If any of those are missing the /object_info probe disables the local path silently.
    • LatentSync: lipsync only. Doesn't add expressions, doesn't fix gaze drift. Edits the mouth region — leaves the rest of the frame alone.
    • fal-sync: cloud-only, billed per render, and a failed render still consumes the credit.
    • All three: doesn't match emotion across long takes. If the script swings from calm to angry, the lipsync handles phonemes but the FACE keeps the input video's emotional register.
    Stacks with
    • Video layer output (silent clip) + voice layer output (audio) → lipsync layer → mouth-synced clip.
    • Local-first routing: pipeline tries EchoMimicV2 → LatentSync → fal-sync if enableLipsync=true.
    • Skip this layer entirely on B-roll, voiceover-only, or ambient-music shots.
  • Voice (F5-TTS local clone / BYO audio / ElevenLabs fallback)

    local

    Three voice paths. F5-TTS is the default — clones from a 10–30 second sample, runs on your GPU, audio never leaves the machine. Hybrig's F5-TTS node ships per-sentence chunking with a 250ms silence pad between each by default, because long run-on input makes every TTS overlap sentences (cloud or local). Bring-your-own-audio is the free path: upload a recording, Hybrig muxes it onto the video. ElevenLabs Multilingual v2 is wired as a cloud fallback when F5-TTS isn't installed or the user wants its prosody.

    What it can’t do
    • F5-TTS: ~10–30s of clean reference audio is mandatory; noisy or compressed samples produce noisy clones. Long run-on text without sentence-level chunking still causes overlap — the chunker fixes the call shape, not the input prose.
    • BYO audio: requires you to record clean audio yourself (any USB mic or the Sony A7 IV's XLR does the job). No clone — every script is a fresh take.
    • ElevenLabs: cloud-only — script + voice sample upload to ElevenLabs servers. Subscription required. Failed renders still consume credits. Used as the cloud fallback path now, not the default.
    Stacks with
    • Script layer text → voice layer audio → lipsync layer (if enabled) → mux.
    • F5-TTS chunked-VO benchmark observed 2026-05-08: chunking + silence inserts ≈ 2× compute time (one model invocation per sentence vs. one for the whole block) with a categorical quality jump in sentence-to-sentence transitions. Same model, same voice clone, same speed setting — only the input segmentation changes.
    • BYO audio is the cheapest, most private path: record yourself once, queue 50 renders, pay $0 in voice fees.
    • Full technique walk-through (punctuation discipline, what works and what doesn't): /learn/voice-cadence.
  • Mux & post (ffmpeg-api / local ffmpeg)

    local + cloud

    The audio + video combiner. fal-ai/ffmpeg-api/merge-audio-video is the default cloud mux (fast, no local ffmpeg dependency); local ffmpeg is wired as a fallback for offline / private workflows. This layer is dumb plumbing — it doesn't generate anything, it just sticks the audio onto the silent video.

    What it can’t do
    • Generate any media of its own. Strictly mux + cut + format conversion.
    • Fix a sync mismatch. If the audio is 8.2 sec and the video is 7.9 sec, the mux just cuts the longer one. Lipsync and clip-duration alignment happen upstream.
    Stacks with
    • Lipsync output (or video output if no lipsync) + voice output → muxed mp4 with audio.
    • Output goes to the export layer for final delivery.
  • Export & delivery

    local

    Final muxed mp4 lands in your storage bucket (Supabase) and on your dashboard. Platform-export presets (vertical 9:16 for TikTok / Reels, square 1:1 for feed, horizontal 16:9 for YouTube) re-frame the keeper without re-rendering. Originals stay on disk — you own the file forever.

    What it can’t do
    • Auto-publish to social. By design — Hybrig is a render tool, not a scheduler.
    • Apply brand-style color grading. If you need DaVinci Resolve grading (the user's workflow), the mp4 export is the handoff to your NLE.
    Stacks with
    • Mux output → platform-export reframe → final delivery file.
    • Files persist to Supabase storage with signed-URL refresh so dashboard thumbnails don't go blank after a week.

Benchmark callout · the brain slot

The pluggable LLM slot is the most-debated part of this stack. Local models in the 7B–13B range excel on summarization, transcription, classification, tagging, short generation — the bulk of what an LLM gets asked to do inside a pipeline. They choke on multi-constraint generation with hard never-rules.

The watchOut field in a typical Deep Intel cohort is the canonical example: eight simultaneous “do not say X” constraints per prospect, where a small model silently drops a subset and the failure rate on cohort-scale generation runs in our testing at roughly 15–30% without anyone noticing until QA. Frontier cloud models hold all eight cleanly. Hybrig picks per task; users override.

Full breakdown — constraint-holding capacity, negation handling, instruction-tuning depth, effective attention across long structured input — in When to use a local LLM and when to reach for cloud.

Model lineage

Each foundational model, every public version

Hybrig doesn’t hide the model history behind a brand name. Below: every public version of the five foundational models in the local stack, in release order, with the one-line positioning each version was the first to deliver. The highlighted card is the build Hybrig actually ships with today.

Wan (Alibaba)

Local video generation — image-to-video and text-to-video.

  1. v01 · 2025
    Wan 2.1 (14B)

    First Wan release that hit production-grade i2v on a single 24 GB GPU. Hybrig's reliable fallback when 2.2 misbehaves on a specific install.

  2. v02 · 2025current
    Wan 2.1 GGUF Q5_K_M

    Quantized port that lets the 14B model run on a 4090 with VRAM headroom. The default Wan build Hybrig ships with.

  3. v03 · 2026
    Wan 2.2

    Sharper close-ups and better motion than 2.1; full weights don't fit a single 24 GB card so Hybrig holds at 2.1 for now. Tracked for future swap.

  4. v04 · 2026
    Wan 2.2 Animate

    Cloud variant on fal.ai with stronger motion vocabulary. Used as a contingency, never the default.

Wan model choice memorialised in project_wan_model_choice.md — Hybrig pins Wan 2.1 14B Q5_K_M GGUF on the 4090.

Flux (Black Forest Labs)

Local still-image generation — photoreal portraits and scenes.

  1. v01 · 2024current
    Flux.1 dev

    The first open-weights Flux release that beat SDXL on photoreal faces. Hybrig's default still-image base.

  2. v02 · 2024
    Flux.1 schnell

    4-step distilled variant for fast drafts. Slight quality dip; same prompt grammar as dev.

  3. v03 · 2024
    Flux.1 Redux

    Image-conditioning sibling — drags the reference's backdrop, pose AND clothing into the output. Use for variations of an existing photo, not for new-scene work.

  4. v04 · 2024
    Flux.1 Fill / Canny / Depth

    ControlNet-style conditioners: inpainting + structural guidance. Lets the same Flux base do edits without retraining.

  5. v05 · 2025
    Flux.1 Kontext

    In-context image editing — describe the change in plain English, base model executes. Roadmap pickup once the open weights stabilise.

Flux versions sourced from the BFL public release notes; verify before quoting on press pages.

F5-TTS

Local voice cloning — flow-matching TTS that runs on your GPU.

  1. v01 · 2024
    F5-TTS Base

    First public release. Flow-matching architecture; clones a voice from ~30 seconds of clean reference audio.

  2. v02 · 2025current
    F5-TTS v1.0

    Stabilised checkpoint with better prosody on long takes. The build Hybrig ships with by default.

F5-TTS is the local-first voice path per project_voice_local_first.md. ElevenLabs is cloud contingency only.

EchoMimic

Local lipsync — regenerates the upper body from one reference frame + audio.

  1. v01 · 2024
    EchoMimic V1

    Original release — face-only lipsync from a single still + audio. Worked but lost identity on long takes.

  2. v02 · 2025current
    EchoMimicV2

    Upper-body extension. Higher fidelity than V1 but discards background motion from the input clip — trade-off Hybrig calls out explicitly.

EchoMimicV2 install requirements documented in src/app/stack/page.tsx layer 'lipsync'.

LatentSync (ByteDance)

Local lipsync — edits the mouth region of an existing video.

  1. v01 · 2024
    LatentSync 1.0

    First public release. Mouth-region inpainting via a SyncNet-supervised diffusion model.

  2. v02 · 2025
    LatentSync 1.5

    Tighter sync on fast speech; reduced jaw flicker on profile shots.

  3. v03 · 2025current
    LatentSync 1.6

    The build Hybrig ships with. Best fidelity-to-speed ratio of the family; preserves background motion that EchoMimicV2 throws away.

LatentSync is the second-stop on Hybrig's local lipsync chain (EchoMimicV2 → LatentSync → fal-sync).

Use-case galleries

What each model does, broken out by intent

The same model behaves differently when asked for editorial vs. lifestyle vs. cinematic output. Each shelf below is a sub-category of one foundational model — Realistic, Cinematic, Editorial, Lifestyle, Diversity. Thumbs marked PLACEHOLDER are stand-ins until the per-category renders land in public/stack-gallery/.

Wan (Alibaba)

Local video generation — image-to-video and text-to-video.

Realistic

4 samples

Hand-held selfie cadence, available-light interiors, pickup-truck B-roll. The bread-and-butter Hybrig render.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Cinematic

4 samples

Anamorphic-style camera moves, slow dolly-in, late-day sun. Wan handles motion direction better than identity here — pair with a strong still.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Lifestyle

4 samples

Walk-and-talk, kitchen counter explainers, jobsite drive-up. Default cadence for spokesperson-style work.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Diversity

4 samples

Different faces, different rigs (car, van, pickup), different time-of-day. Wan's identity wobble shows up here — a good stress test.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Flux (Black Forest Labs)

Local still-image generation — photoreal portraits and scenes.

Realistic

4 samples

Photoreal portraits with a trained LoRA on top. The strongest identity-locked still Hybrig can produce on local.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Cinematic

4 samples

Wide environmental portraits, telephoto compression, controlled rim light. Flux dev + style LoRA territory.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Editorial

4 samples

Magazine-style framing, intentional negative space, restrained colour. Where Flux beats SDXL most clearly.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Lifestyle

4 samples

Soft-lit kitchen, jobsite mid-morning, golden-hour driveway. The 'feels like a real photo' zone.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Diversity

4 samples

Multiple subjects, varied skin tones, varied hairstyles. Honest-look check that the LoRA isn't collapsing to one face.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

F5-TTS

Local voice cloning — flow-matching TTS that runs on your GPU.

Realistic

3 samples

Conversational read of a 60-word script. Reference-clone fidelity check: prosody matches, not just timbre.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Cinematic

3 samples

Slower, deliberate cadence with intentional breath pauses. Stress test for long-form narration.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Lifestyle

3 samples

Casual UGC tone — laughing-while-talking, conversational hesitations preserved.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

EchoMimic

Local lipsync — regenerates the upper body from one reference frame + audio.

Realistic

3 samples

Single-frame upper-body regen tied to a F5-TTS clone. Hybrig's default local lipsync path.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Cinematic

3 samples

Held close-up with shallow depth-of-field; EchoMimic preserves the bokeh on the regenerated body.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Lifestyle

3 samples

Talking-head on a kitchen counter, soft window light. Background motion is intentionally minimal — plays to EchoMimic's strengths.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

LatentSync (ByteDance)

Local lipsync — edits the mouth region of an existing video.

Realistic

3 samples

Mouth-only edit on existing Wan 2.1 footage — background motion preserved. The everyday LatentSync output.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Cinematic

3 samples

Tracking shot through a doorway, sync edit applied to the speaker only. LatentSync's compositional respect shows up here.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder

Lifestyle

3 samples

Walk-and-talk with hand gestures and head turns. LatentSync handles the off-axis sync without breaking the rest of the frame.

  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
  • Placeholder stillplaceholder
Focal isolation · Flux + LoRA

Which face regions the identity layer can actually target

Cloud avatar tools treat “face lock” as one button. The reality on Flux + LoRA + PuLID is six separate regions, each with different behaviour and different limits. The diagram below is the cheat-sheet — what each region locks, what it leaves drifting, and which prompt you actually need to add to compensate.

faceskinnosemoutheyehair

static diagram · no interactivity yet

  • Face geometry

    face

    PuLID-Flux locks the underlying bone structure across shots. Strongest single anchor for 'is this the same person'.

    Won't hold hairstyle, won't hold clothing — those drift. Pair with a dedicated LoRA for full lock.

  • Skin texture

    skin

    Skin-detail LoRAs add pore-level realism the base Flux model smooths out. Preserves age and weathering.

    Stack too high and faces start to look hyperreal — sweet spot is 0.4-0.6 strength.

  • Nose shape

    nose

    Trained LoRA preserves the bridge, tip and nostril shape that PuLID alone can drift on at 3/4 angles.

    The hardest single feature for face-swap models — expect 1-in-5 frames to need a re-roll.

  • Mouth shape

    mouth

    Lip thickness, smile asymmetry, philtrum depth. Identity layer locks shape; lipsync layer drives motion.

    Identity LoRA only sets shape, NOT motion. Lipsync is a separate layer downstream.

  • Eye shape + colour

    eyes

    Iris colour, eye shape, eyelid fold. Where PuLID-Flux reliably wins over IPAdapter for identity.

    Gaze direction is set by the prompt, not the identity LoRA. 'Looking at camera' has to be in the prompt.

  • Hair (line + colour)

    hair

    Trained LoRA locks hairline and dominant colour. PuLID alone will not hold these.

    Hairstyle (length, parting, texture) drifts with each render — describe it in the prompt every time.

Pipeline flows

Three concrete pipelines, end to end

Same nine layers, three different routings. The first stays on your rig as much as possible. The second pays a cloud bill to save a critical shot. The third runs zero cloud video at all — bring your own audio, the rig does the rest.

  • Full-local spokesperson

    You have a trained LoRA of yourself, a reference recording for voice clone, and you want a 30-second testimonial. Everything stays on your rig — voice clone runs locally on F5-TTS.

    1. localInputsSeed idea + LoRA + 30-sec voice sample
    2. cloudScript (Claude)1 variant, 75-90 word target
    3. localImage (Flux dev + LoRA)Reference still, identity locked
    4. localVideo (Wan 2.2 + LoRA)Image-to-video, silent clip
    5. localVoice (F5-TTS chunked)Per-sentence + 250ms silence pad, local on your GPU
    6. localLipsync (EchoMimicV2)Local lipsync if installed, else LatentSync
    7. localMux + exportLocal ffmpeg if available, else fal-ffmpeg

    Cost$0 per render after the script-layer LLM call. Voice runs free on your rig.

  • Cloud rescue for the keeper

    Wan 2.2 missed on a critical close-up. The face goes off-model on second 4. You burn a Seedance render to save the shot.

    1. localInputsSame seed, same reference still
    2. localImage (Flux dev + LoRA)Same reference still — identity layer pre-locked
    3. cloudVideo (Seedance 2.0 via fal)Cloud — billed per second
    4. cloudVoice (ElevenLabs)Reuse the audio from the local pass
    5. cloudLipsync (fal-sync)Tight cloud lipsync to match the cloud video grade
    6. cloudMux + exportfal-ffmpeg for parity with the cloud chain

    CostPer-second cloud video bill + lipsync bill. Use sparingly — this is the rescue valve, not the default.

  • Bring-your-own everything

    Maximum privacy mode. You record your voice yourself (no clone), and you have a LoRA. Everything except the script generation runs on your GPU.

    1. localInputsPre-recorded audio file + LoRA + reference photo
    2. localScriptSkip — bring your own script
    3. localImage (Flux dev + LoRA)Reference still, identity locked
    4. localVideo (Wan 2.2 + LoRA)Silent clip
    5. localVoiceSkip TTS — your file IS the audio
    6. localLipsync (EchoMimicV2 / LatentSync)Local lipsync to your real voice
    7. localMux + exportLocal ffmpeg

    Cost$0 per render, period. The rig is the only cost — and you already own it.

Current verdict · 2026-05-01

Why this stack, today

Every layer above tells you what a model CAN’T do. This section tells you what we actually picked, and what we tried first that didn’t make the cut. Plain English. No methodology footnotes. Receipts on the roadmap.

Flux + PuLIDLocked still frameSTEP 01Wan 2.2 I2VAnimate to 5-sec clipSTEP 02DaVinci ResolveVoice + cadence postSTEP 03

Step 01 — Flux + PuLID-Flux for the still frame

pick

We tried this with a 21-photo studio pack of the user’s face. Flux gives the photoreal base; PuLID-Flux locks face geometry across all 21 reference frames at once — not just one. The keeper still comes out on-model on the first or second pass, no hand-tuning. This is the front door of every spokesperson render the user ships today.

Step 02 — Wan 2.2 image-to-video

pick

The Flux+PuLID still already nailed identity, so the video layer only has one job: animate ONE high-fidelity frame into a 5-second clip. Wan 2.2 i2v does that on a 4090 in 3-8 minutes, free, on the user’s rig. Cloud video models refuse real faces; Wan doesn’t care. This is where the whole local-first thesis pays off.

Step 03 — DaVinci Resolve for audio post

pick

Wan generates its own throwaway audio. We mute it, drop the ElevenLabs voice clone of the user onto the timeline in Resolve, and slide cadence to match the mouth motion the video already has. No lipsync model in the loop for this pipeline — Resolve is the user’s NLE anyway, and the manual slide is faster and cleaner than another generative pass.

What we tested and rejected

Six alternative routes we ran on the same 21-photo pack before settling on the picks above. Verdict tag, one line, no spin.

  • Flux + IP-Adapter (XLabs)

    Better at mood and color than identity geometry. PuLID wins for matching a real person across shots.

    style transfer, not face-lock
  • HiDream + IP-Adapter

    Workable, but the IP-Adapter port is newer and the LoRA library is thinner. No clear win over Flux+PuLID.

    smaller ecosystem
  • Flux + LoRA only

    Works once the dataset hits 15-30 varied shots. Today’s 21-photo single-session pack is usable but not bulletproof. PuLID gets there faster from the same pack.

    needs more photo variety
  • SDXL + IP-Adapter

    Installed, working, fine for stylized or painterly work. Won’t hold up as photoreal James.

    not photoreal enough
  • Wan + character pack alone (i2v with first frame)

    Silently uses 1 of N reference images. Architectural mismatch with multi-frame studio packs — the other 20 photos do nothing.

    throws away your reference set
  • Seedance multi-ref (cloud)

    Native multi-reference works. Reserved for hero-tier shots where the budget supports it. Not the local-first default.

    great, but $0.24-0.30/sec

These verdicts will change as the tech evolves. We’ll track them over time on a public score timeline. See the roadmap for what’s coming.

Positioning

Engineered for creators who own their rigs.

The cloud avatar industry sells rented seats and capped credits to people who don’t own a GPU. Hybrig is built for the other side of that line — creators who already have the iron, who want to OWN their stack instead of paying for the privilege of borrowing it. Every model in this page runs on hardware you control, with files you keep, on schedules you set. No queue. No cap. No upload. Pro tools for the rig in your room.

Observed benchmark · 2026-05-08

Wall-clock, on one 4090, rendered serial

A single test session, six 87-second 1080×1920 vertical commercials, one customization slot wired (contractor name overlaid on a truck door panel). Master stills + master voice-over generated once and reused; only the prospect data varied per render. Numbers are observed, not extrapolated.

RunWall-clockNotes
Render 0165.5sCold start, master stills loaded from disk.
Render 0264.2sWarm cache.
Render 0366.7sWarm cache.
Render 0462.3sWarm cache.
Render 0561.5sWarm cache.
Render 0662.6sWarm cache.
Average~64s/renderSingle 4090, serial. Master assets reused across all 6 renders.

Linear projection from the same average across a 76-prospect cohort: ~80 minutes serial on one 4090; ~25–30 minutes with 2–3 concurrent renders on one rig (within VRAM headroom for the 15-still + overlay path). Marginal cost per render: electricity. No per-render meter, no credit decrement, no failed-render charge.

Cloud equivalent for the same 76 renders, per-render video tools (Runway Gen-3, Sora, Veo, Kling) at a defensible $2–$15/render spread for an 87-second customized commercial — range depending on provider, length, and quality tier: $150 to $1,140 per cohort re-render. The spread is wide on purpose; pick a provider, pick a length, the bill lands somewhere in there. The point isn’t which end of the range — the point is that the range exists at all, every time.

Methodology: 6 serial renders on one RTX 4090, Wan 2.2 + Flux + LoRA path, master stills + master VO held constant, contractor-name overlay swapped per run. Wall-clock measured end-to-end (queue submit to mp4 on disk). Cohort projection assumes the same per-render cost; real-world variance widens under prompt re-roll or asset re-bake. Cloud range cited from per-render and per-credit pricing tiers; failed-render charges excluded for the low end.

Cloud-impossible math

Why this stack can’t live on a render farm

Run this nine-layer stack on a cloud render farm and you pay three meters at once: GPU-hour for image + video + lipsync, per-second for any cloud video model in the chain, and a per-render fee for voice. A creator who renders 30 spokesperson clips a month on a managed cloud avatar service is looking at $30 to $61 a month on Synthesia or HeyGen — capped at 10 to 30 minutes of video output. Hit the cap on draft iterations alone and you’re paying for failed renders.

The same 30 clips on Hybrig: roughly ~$10/month in electricity (1.75 kWh on a 4090 at $0.15/kWh, rounded up for power-supply overhead and AC offset). No cap. No per-render meter. Failed renders are free — the GPU just runs again. After the GPU is paid off, drafts are free forever.

The other half of the math: cloud avatar services increasingly refuse real-person likeness via content policy. Seedance, HeyGen, Synthesia all have human-image guardrails that flag user-uploaded faces. The local stack doesn’t — your face, your LoRA, your rig, your call. THAT’S the part you can’t buy on a render farm at any price.

Cost numbers cross-checked against /for-creators and /for-farms. Run your own scenario at /calculator.