Model transparency

Every model. What it does, what it can’t, when to use it.

Cloud avatar tools hide their model behind a single “Generate” button. You never learn what your tool can or can’t do — until it fails on the shot you needed. Hybrig is the inverse bet: every model on one rig, every limit out in the open.

Below: the seven image models, five video models, five identity / conditioning layers, and three voice paths Hybrig knows about. Each one with what it’s best for, what it fails at, and which other layers it pairs with. No marketing fluff — if a model is mediocre at hands, this page says so.

Image generation

Seven still-image models, all local

Every still-image generator on Hybrig runs on your own GPU. There is no cloud image-gen entry on purpose — if you want cloud image gen, the wardrobe-lock and inpainting tools handle specific edits, but free-form generation stays on your rig. Default is Flux dev.

  • Flux dev (fp8)

    local

    Black Forest Labs' general-purpose photoreal model. The default for real-people stills on Hybrig. ~30 sec per image on a 4090, ~12 GB GPU memory.

    Best for
    • Photoreal portraits of real people
    • Brand stills, product shots, lifestyle frames
    • LoRA training base — biggest LoRA ecosystem in the catalog
    Fails at
    • Stylized / painterly / anime — Flux is photoreal-first; for cartoons use SDXL+Pony or Illustrious
    • Low-VRAM rigs under 12 GB — won't fit; drop to PixArt-Σ instead
    • Text rendering inside the image is decent but not pixel-perfect
    Pairs with
    • Flux dev + a trained character LoRA = best identity lock on the platform
    • Flux dev + PuLID-Flux = single-photo identity conditioning when no LoRA exists yet
  • Flux schnell

    local

    Same Flux family as dev but distilled to 4 steps. ~8 sec per image. Slight quality dip, mostly invisible on simple shots.

    Best for
    • Drafts and prompt iteration
    • Style search before committing to a finished render
    • Anything where a 4x speedup matters more than the last 10% of detail
    Fails at
    • Detail-heavy finals — the 4-step distillation drops fine texture
    • Same 12 GB GPU memory floor as dev — schnell isn't the low-VRAM option
    Pairs with
    • Flux schnell for ideation, then re-render the keeper on Flux dev
  • SDXL + Pony Diffusion v6

    local

    SDXL with the Pony v6 fine-tune. The standard for stylized, painterly, cartoon-adjacent work. ~6.5 GB checkpoint, ~10 GB GPU memory, ~18 sec per image.

    Best for
    • Stylized portraits, painterly compositions, concept art
    • Huge community LoRA library — character + style
    • Faster than Flux when stylization matters more than photorealism
    Fails at
    • Photoreal faces — that is not Pony's job; use Flux instead
    • Default style is heavily NSFW-tilted; needs prompt discipline
    Pairs with
    • Pony + IPAdapter (SDXL) for reference-driven stylization
  • SDXL + Illustrious-XL

    local

    SDXL with the Illustrious-XL fine-tune. The standard for anime / illustration. Trained on Danbooru tags — responds to long comma-separated tag prompts.

    Best for
    • Anime portraits, illustration, manga panels
    • Tag-style prompting (Danbooru vocabulary)
    • Active community LoRA scene for anime characters
    Fails at
    • Photoreal — faces look painted
    • Prose prompts — it expects tags, not full sentences
  • Stable Diffusion 3.5 Large

    local

    Stability AI's flagship. Softer, more cinematic look than Flux. ~6 GB checkpoint, ~12 GB GPU memory, ~35 sec per image.

    Best for
    • Cinematic / painterly stills
    • Editorial illustration, stylized portraits
    • Single-file checkpoint — simpler install than Flux
    Fails at
    • Photoreal faces lose a touch of detail vs. Flux
    • Smaller LoRA ecosystem than Flux or SDXL
  • HiDream-I1 (Full)

    local

    Late-2024 photoreal model competing with Flux on quality. ~17 GB on disk, ~16 GB GPU memory, ~45 sec per image. Smaller LoRA library because it's newer.

    Best for
    • Photoreal portraits where Flux occasionally falls short on fine detail
    • Detail-critical product shots
    Fails at
    • Heavy footprint — 17 GB on disk, 16 GB GPU memory floor
    • Newer model — community workflows still settling, fewer LoRAs
    • Currently disabled in Hybrig until a dedicated loader is wired
  • PixArt-Σ

    local

    Tiny, fast, prompt-following model. ~2.5 GB on disk, ~6 GB GPU memory. Good for drafts and weak GPUs.

    Best for
    • Quick drafts and prompt iteration
    • Weak / low-VRAM GPUs (6 GB minimum)
    • Strong prompt-following — does what you asked
    Fails at
    • Visible quality gap vs. Flux / SD 3.5
    • Faces and hands degrade fast on complex shots
Video generation

Four local video models, two cloud polish layers

Default is Wan 2.2 — runs on your GPU, free forever. The auto-fallback chain tries Wan 2.2 → Wan 2.1 → Hunyuan → LTX before reaching for cloud. Cloud has two roles, not one: Seedance for face lock + lip-sync rescue, Gemini Omni for physics + world-knowledge polish on hero beats. Both are polish-after-farming, never the farm engine — cloud structurally can't farm.

  • Wan 2.2 (local)

    local
    RTX 409024G
    VRAM
    16G · fits
    0% vs 4090

    Alibaba's Wan 2.2. The default local video model on Hybrig. ~3-8 min per 5-second clip on a 4090. Free forever — runs on your own GPU.

    Face7.8
    Motion7.2
    Identity8.1
    Best for
    • Spokesperson clips on your own GPU — no per-second meter
    • Privacy-required work — footage never leaves the machine
    • High-iteration drafting where re-takes are free
    Fails at
    • Demanding face-heavy shots vs. Seedance — still ~10-20% behind on tight close-ups
    • Hands and crowd scenes weaker than top cloud models
    • No native lip-sync — needs a separate lip-sync pass
    Pairs with
    • Wan 2.2 + a trained character LoRA = identity-locked video on free local
    • Wan 2.2 → cloud-burst Seedance fallback when 2.2 misses on a shot
  • Wan 2.1 (local)

    local
    RTX 409024G
    VRAM
    16G · fits
    0% vs 4090

    Older Wan generation kept around as a backup for when 2.2 misbehaves on your install. Same workflow, slightly weaker motion + identity. Also: when you're doing standard image-to-video (one starting frame only), 2.1 still wins on quality — newer is not always better.

    Face7.4
    Motion6.5
    Identity7.8
    Best for
    • Single-frame image-to-video where 2.1 still beats 2.2 in real-world tests on consumer GPUs
    • Fallback when Wan 2.2 chokes on a specific install
    • Same 16+ GB GPU memory profile, mature workflow
    Fails at
    • First/last frame interpolation — 2.1 simply doesn't have this capability, you must use Wan 2.2 FLF2V
    • ~70-80% of Seedance quality on demanding shots
    • Hands and complex motion noticeably weaker than 2.2
    • No native lip-sync
  • Wan 2.2 FLF2V (local)

    local

    The only LOCAL model with first/last frame interpolation. Plain English: feed it two images — a starting frame and an ending frame — and it generates the video that morphs between them. The ONLY way to do this locally for free. Wan 2.1, Hunyuan, and LTX do not have this capability. Runs ~5-8 min per 5-second clip on a 4090. Free forever.

    Best for
    • Weathering time-lapses — fresh shingle (year 0) → 25-year-aged shingle (year 25), morphed by the model
    • Before/after restoration shots — identical framing, only the surface changes
    • Age progression on locked portraits — same face, same camera, different years
    • Any morph where you want both ENDS of the clip locked to specific images
    Fails at
    • Single-frame image-to-video — when you only have a starting frame, Wan 2.1 wins on quality. Swap back.
    • Source images that don't match framing — both stills must have identical camera, crop, and subject position or the morph looks broken
    • Longer than 5s — caps at 5 seconds output; chain multiple clips for longer morphs
    • Wan 2.2 base model doesn't fit most consumer GPUs in full precision — we ship the Q5_K_M quantized build (compressed version that runs on smaller GPUs at slightly lower quality)
    Pairs with
    • Two Flux stills (year 0 fresh + year 25 aged) → Wan 2.2 FLF2V = local weathering time-lapse, no cloud, no per-second billing
    • When you only have ONE keyframe and need motion, swap back to Wan 2.1 — newer is not always better; pick by capability needed, not version number
  • HunyuanVideo (local)

    local
    RTX 409024G
    VRAM
    16G · fits
    0% vs 4090

    Tencent's HunyuanVideo. Strong on natural motion — body movement, physics, walking, gestures. Slower than Wan: 6-12 min per 5-second clip on a 4090. ~13 GB on disk.

    Face7.0
    Motion8.4
    Identity7.0
    Best for
    • Action shots, walking, body physics, gesture-heavy scenes
    • When Wan's motion looks stiff and you need realistic limb movement
    Fails at
    • Slower than Wan — heavier model
    • Identity lock weaker than Wan on tight close-ups
    • No native lip-sync
    Pairs with
    • HunyuanVideo for motion + Wan 2.2 keyframe + identity-lock LoRA
  • LTX-Video (local)

    local
    RTX 409024G
    VRAM
    12G · fits
    0% vs 4090

    Lightricks' LTX-Video. Small, fast, lower quality. ~9 GB on disk, runs on as little as 12 GB GPU memory. ~1 min per 5-second clip on a 4090.

    Face6.0
    Motion6.0
    Identity6.0
    Best for
    • Drafting motion direction before committing to a slow model
    • Weak GPUs that can't load Wan or Hunyuan
    • When you just need to see if the camera move idea works
    Fails at
    • Visibly lower quality than Wan 2.x or Hunyuan
    • Identity preservation weaker on close-ups
    • Less detailed motion — best for simple shots
  • Seedance 2.0 (cloud — rescue valve)

    cloud

    ByteDance's Seedance 2.0. Top-tier face lock + cinematic VFX. Cloud-only — billed per second of output. The rescue valve when local misses on a critical shot, not the default.

    Best for
    • Premium brand finals where face lock under heavy motion is non-negotiable
    • Cinematic VFX and lighting consistency
    • Lip-sync passes (native muxed audio support)
    Fails at
    • Mediocre on hands when subject gestures
    • Backgrounds drift on clips longer than ~6 seconds
    • Per-second billing — failed renders still consume the credit
    Pairs with
    • Wan 2.2 for drafts + Seedance for the keeper finals
    • Seedance Standard for finals, Seedance Fast for A/B drafting
  • Gemini Omni (cloud — physics polish)

    cloud

    Google DeepMind's multimodal video model. Conversational editing, physics-aware motion, world-knowledge compositing. Subscription-gated through Google AI / Google Flow. Polish layer for specific hero beats, never the farm engine — cloud structurally can't farm.

    Best for
    • Hero shots that need physics simulation (water, gravity, kinetic objects) local models can't fake
    • Compositing beats that need world-knowledge context (a specific historical scene, a cultural reference)
    • Iterative conversational refinement on a SINGLE keeper after the local farm has settled the direction
    Fails at
    • Farming — Google AI subscription + per-render metering breaks unit economics at any scale beyond one-off hero shots
    • Privacy work — content leaves the machine
    • Access — gated behind Google AI subscription, region-limited, allowlist-controlled
    Pairs with
    • Local farm (Wan + LoRA) for the cohort + Omni polish on one or two hero beats per spot
    • Use AFTER the render direction is finalized locally — never as the drafting engine
Identity & conditioning

How identity locks across shots

The hardest problem in AI video isn't generating one good frame — it's generating the SAME face across fifty frames, twenty shots, six campaigns. Hybrig gives you five layers, ranked by lock strength: trained LoRA (strongest, slowest to set up) down through PuLID, Redux, IPAdapter, to wardrobe-lock (weakest, narrowest scope).

  • Per-character LoRA

    Train a small adapter (~50-200 MB .safetensors) on 15-30 reference photos of one person. The trained file lives on your drive forever — no subscription, no vendor lock-in.

    Best for
    • Strongest identity lock available on Hybrig — locks across many shots
    • Repeat characters across a series, a campaign, a multi-scene project
    • Portable: the .safetensors is a file you own
    Fails at
    • Needs 15-30 varied photos — different lighting, angles, expressions, wardrobe. 10 photos in one outfit collapses the LoRA to a generic face.
    • Watch for EXIF orientation: rotated photos train the LoRA on sideways faces. Hybrig auto-corrects, but check your inputs.
    • 1-2 hours of GPU time to train (one-time cost). Renting a SaaS face slot is faster up front, but you pay forever.
    Pairs with
    • LoRA + Flux dev = strongest still-image identity lock
    • LoRA + Wan 2.2 = identity-locked video on free local
    • LoRA + IPAdapter (SDXL) when the LoRA is on an SDXL base
  • PuLID-Flux

    Single-photo face-identity conditioning for Flux. Conditions on face geometry only — doesn't drag the reference's backdrop, pose, or clothing into the output. The scene prompt actually steers the render.

    Best for
    • Putting a face into a NEW scene from one reference photo (no LoRA needed)
    • Quick identity conditioning when you don't have 15-30 photos to train a LoRA
    • Recommended strength 0.9-1.0 (per the PuLID-Flux v0.9.1 README)
    Fails at
    • Identity lock not as tight as a trained LoRA across many shots
    • Flux-only — does not work on SDXL family or Wan video models
    • Requires the ComfyUI-PuLID-Flux-Enhanced custom node + InsightFace + EVA-CLIP weights
    Pairs with
    • PuLID-Flux + scene prompt at strength 0.9 = new-scene character generation from a single photo
    • PuLID-Flux takes precedence over Redux when both are available — PuLID locks identity, Redux pulls style
  • Flux Redux

    Black Forest Labs' style-and-composition conditioning for Flux. Stronger than IPAdapter for Flux because it ships native — no XLabs/PuLID custom node required. Runs on stock ComfyUI.

    Best for
    • Variations of an existing photo — same vibe, different angle
    • Style transfer when you have a reference image whose look you want to keep
    • Quick drop-in identity conditioning when PuLID isn't installed
    Fails at
    • Redux remakes the entire reference image — same backdrop, same pose, same clothes. Use it for variations of an existing photo, not for putting your face in a new place.
    • Won't respect the scene prompt as strongly as PuLID — the reference image's composition wins
    • Not a hard identity lock the way PuLID-Flux or a trained LoRA is
    Pairs with
    • Redux for style + a trained LoRA for identity = best of both
    • Skip Redux if you have PuLID — PuLID is the better identity path
  • IPAdapter (SDXL)

    Reference-image conditioning for the SDXL family (base, Pony, Illustrious). The current canonical file is ip-adapter-plus_sdxl_vit-h.safetensors via cubiq's ComfyUI_IPAdapter_plus.

    Best for
    • SDXL / Pony / Illustrious reference-image conditioning
    • Style + composition reference on stylized models
    • Adjustable strength via the IPAdapterAdvanced weight knob
    Fails at
    • SDXL-family only — Flux uses Redux or PuLID instead
    • Identity lock weaker than a trained LoRA on close-ups
    • Naming varies a lot — Hybrig probes ComfyUI's /object_info to find whatever's actually installed instead of hardcoding a filename
  • Wardrobe-lock (Gemini)

    cloud

    Google Gemini 2.5 Flash Image ("nano-banana") via Hybrig's image-edit provider. Person photo + outfit photo in, person wearing the new outfit out — face preserved.

    Best for
    • Swapping outfits without changing the face
    • Fast (single API call, ~10-20 sec) compared to re-rendering from scratch
    • Honest about being cloud — Gemini is the only model in this category that solves wardrobe swap reliably right now
    Fails at
    • Cloud-only — files leave the device for this one step (Gemini terms apply)
    • Outfit fidelity drops on heavily patterned or branded garments
    • Doesn't change pose or scene — wardrobe only
    Pairs with
    • Gemini wardrobe-edit → re-shoot the edited image into a LoRA training set when you want a new wardrobe permanently locked into the character
Voice

Cloned, recorded, or local

Hybrig's voice path is honest: ElevenLabs is the production clone today (cloud, $22/mo), bring-your-own-audio is the free path if you'd rather record yourself, and local TTS (XTTS / Piper) is on the roadmap as the privacy-first replacement for ElevenLabs.

  • ElevenLabs (cloud)

    cloud

    Production voice-clone TTS path. Eleven Multilingual v2 by default. ~$22/mo subscription. Hybrig clones once from a 30-second sample, then synthesizes any script.

    Best for
    • Voice cloning from a 30-second reference recording
    • Multilingual scripts (v2 model)
    • Production-grade prosody and emotion
    Fails at
    • Cloud-only — script + voice sample upload to ElevenLabs servers
    • Subscription required ($22/mo at writing) — unlike the rest of Hybrig, this one isn't free
    • Failed renders still consume credits
    Pairs with
    • ElevenLabs voice clone → Wan 2.2 video + lipsync pass = full local-first spokesperson on free local video, paid only for the voice
  • Bring-your-own audio

    Skip TTS entirely. Upload a pre-recorded audio file and Hybrig muxes it onto the video. Authentic voice, no clone artifacts, no ElevenLabs cost.

    Best for
    • When you'd rather record yourself than clone yourself
    • Languages or accents the TTS doesn't nail
    • Zero-cost voice path — the upload is local, the mux is local
    Fails at
    • Requires you to actually record clean audio (Sony A7 IV's XLR or any USB mic does the job)
    • No clone — you have to record every script yourself
  • Local TTS (XTTS / Piper)

    roadmap

    Fully local voice synthesis. Not yet wired into Hybrig's pipeline — flagged here because James plans to add it. XTTS clones from a short reference; Piper is fast non-clone TTS.

    Best for
    • Privacy-required projects where the voice can't leave the machine
    • Offline workflows (planes, no-wifi shoots, Faraday environments)
    • Replacing the $22/mo ElevenLabs cost when quality is good enough
    Fails at
    • Not yet integrated — currently a roadmap item, not shippable
    • XTTS clone quality below ElevenLabs on prosody and emotion
    • Piper sounds robotic next to a clone — it's a non-clone TTS
Layer interaction

What stacks, what conflicts

Identity layers don’t all play nice together. Two layers conditioning the same Flux pathway will fight; a Flux layer and an SDXL layer just don’t see each other. Below: which combinations stack cleanly, which conflict.

LayerLoRAPuLID-FluxFlux ReduxIPAdapter (SDXL)
LoRAstacksstacksstacks
PuLID-Fluxstacksconflicts
Flux Reduxstacksconflicts
IPAdapter (SDXL)stacks
  • LoRA: LoRA is the strongest layer. Anything else stacks on top — LoRA owns identity, the other layer adds style or reference.
  • PuLID-Flux: PuLID and Redux both condition the same Flux pathway — Hybrig picks PuLID when both are available because PuLID respects the prompt; Redux drags the reference image's composition.
  • Flux Redux: Redux + LoRA stacks fine — Redux pulls style, LoRA holds identity. Redux + PuLID conflicts (same conditioning slot).
  • IPAdapter (SDXL): IPAdapter is SDXL-family only. It stacks with SDXL-base LoRAs but never sees a Flux model.