Every model. What it does, what it can’t, when to use it.
Cloud avatar tools hide their model behind a single “Generate” button. You never learn what your tool can or can’t do — until it fails on the shot you needed. Hybrig is the inverse bet: every model on one rig, every limit out in the open.
Below: the seven image models, five video models, five identity / conditioning layers, and three voice paths Hybrig knows about. Each one with what it’s best for, what it fails at, and which other layers it pairs with. No marketing fluff — if a model is mediocre at hands, this page says so.
Seven still-image models, all local
Every still-image generator on Hybrig runs on your own GPU. There is no cloud image-gen entry on purpose — if you want cloud image gen, the wardrobe-lock and inpainting tools handle specific edits, but free-form generation stays on your rig. Default is Flux dev.
Flux dev (fp8)
localBlack Forest Labs' general-purpose photoreal model. The default for real-people stills on Hybrig. ~30 sec per image on a 4090, ~12 GB GPU memory.
Best for- Photoreal portraits of real people
- Brand stills, product shots, lifestyle frames
- LoRA training base — biggest LoRA ecosystem in the catalog
Fails at- Stylized / painterly / anime — Flux is photoreal-first; for cartoons use SDXL+Pony or Illustrious
- Low-VRAM rigs under 12 GB — won't fit; drop to PixArt-Σ instead
- Text rendering inside the image is decent but not pixel-perfect
Pairs with- Flux dev + a trained character LoRA = best identity lock on the platform
- Flux dev + PuLID-Flux = single-photo identity conditioning when no LoRA exists yet
Flux schnell
localSame Flux family as dev but distilled to 4 steps. ~8 sec per image. Slight quality dip, mostly invisible on simple shots.
Best for- Drafts and prompt iteration
- Style search before committing to a finished render
- Anything where a 4x speedup matters more than the last 10% of detail
Fails at- Detail-heavy finals — the 4-step distillation drops fine texture
- Same 12 GB GPU memory floor as dev — schnell isn't the low-VRAM option
Pairs with- Flux schnell for ideation, then re-render the keeper on Flux dev
SDXL + Pony Diffusion v6
localSDXL with the Pony v6 fine-tune. The standard for stylized, painterly, cartoon-adjacent work. ~6.5 GB checkpoint, ~10 GB GPU memory, ~18 sec per image.
Best for- Stylized portraits, painterly compositions, concept art
- Huge community LoRA library — character + style
- Faster than Flux when stylization matters more than photorealism
Fails at- Photoreal faces — that is not Pony's job; use Flux instead
- Default style is heavily NSFW-tilted; needs prompt discipline
Pairs with- Pony + IPAdapter (SDXL) for reference-driven stylization
SDXL + Illustrious-XL
localSDXL with the Illustrious-XL fine-tune. The standard for anime / illustration. Trained on Danbooru tags — responds to long comma-separated tag prompts.
Best for- Anime portraits, illustration, manga panels
- Tag-style prompting (Danbooru vocabulary)
- Active community LoRA scene for anime characters
Fails at- Photoreal — faces look painted
- Prose prompts — it expects tags, not full sentences
Stable Diffusion 3.5 Large
localStability AI's flagship. Softer, more cinematic look than Flux. ~6 GB checkpoint, ~12 GB GPU memory, ~35 sec per image.
Best for- Cinematic / painterly stills
- Editorial illustration, stylized portraits
- Single-file checkpoint — simpler install than Flux
Fails at- Photoreal faces lose a touch of detail vs. Flux
- Smaller LoRA ecosystem than Flux or SDXL
HiDream-I1 (Full)
localLate-2024 photoreal model competing with Flux on quality. ~17 GB on disk, ~16 GB GPU memory, ~45 sec per image. Smaller LoRA library because it's newer.
Best for- Photoreal portraits where Flux occasionally falls short on fine detail
- Detail-critical product shots
Fails at- Heavy footprint — 17 GB on disk, 16 GB GPU memory floor
- Newer model — community workflows still settling, fewer LoRAs
- Currently disabled in Hybrig until a dedicated loader is wired
PixArt-Σ
localTiny, fast, prompt-following model. ~2.5 GB on disk, ~6 GB GPU memory. Good for drafts and weak GPUs.
Best for- Quick drafts and prompt iteration
- Weak / low-VRAM GPUs (6 GB minimum)
- Strong prompt-following — does what you asked
Fails at- Visible quality gap vs. Flux / SD 3.5
- Faces and hands degrade fast on complex shots
Four local video models, two cloud polish layers
Default is Wan 2.2 — runs on your GPU, free forever. The auto-fallback chain tries Wan 2.2 → Wan 2.1 → Hunyuan → LTX before reaching for cloud. Cloud has two roles, not one: Seedance for face lock + lip-sync rescue, Gemini Omni for physics + world-knowledge polish on hero beats. Both are polish-after-farming, never the farm engine — cloud structurally can't farm.
Wan 2.2 (local)
localRTX 409024GVRAM16G · fits−0% vs 4090Alibaba's Wan 2.2. The default local video model on Hybrig. ~3-8 min per 5-second clip on a 4090. Free forever — runs on your own GPU.
Face7.8Motion7.2Identity8.1Best for- Spokesperson clips on your own GPU — no per-second meter
- Privacy-required work — footage never leaves the machine
- High-iteration drafting where re-takes are free
Fails at- Demanding face-heavy shots vs. Seedance — still ~10-20% behind on tight close-ups
- Hands and crowd scenes weaker than top cloud models
- No native lip-sync — needs a separate lip-sync pass
Pairs with- Wan 2.2 + a trained character LoRA = identity-locked video on free local
- Wan 2.2 → cloud-burst Seedance fallback when 2.2 misses on a shot
Wan 2.1 (local)
localRTX 409024GVRAM16G · fits−0% vs 4090Older Wan generation kept around as a backup for when 2.2 misbehaves on your install. Same workflow, slightly weaker motion + identity. Also: when you're doing standard image-to-video (one starting frame only), 2.1 still wins on quality — newer is not always better.
Face7.4Motion6.5Identity7.8Best for- Single-frame image-to-video where 2.1 still beats 2.2 in real-world tests on consumer GPUs
- Fallback when Wan 2.2 chokes on a specific install
- Same 16+ GB GPU memory profile, mature workflow
Fails at- First/last frame interpolation — 2.1 simply doesn't have this capability, you must use Wan 2.2 FLF2V
- ~70-80% of Seedance quality on demanding shots
- Hands and complex motion noticeably weaker than 2.2
- No native lip-sync
Wan 2.2 FLF2V (local)
localThe only LOCAL model with first/last frame interpolation. Plain English: feed it two images — a starting frame and an ending frame — and it generates the video that morphs between them. The ONLY way to do this locally for free. Wan 2.1, Hunyuan, and LTX do not have this capability. Runs ~5-8 min per 5-second clip on a 4090. Free forever.
Best for- Weathering time-lapses — fresh shingle (year 0) → 25-year-aged shingle (year 25), morphed by the model
- Before/after restoration shots — identical framing, only the surface changes
- Age progression on locked portraits — same face, same camera, different years
- Any morph where you want both ENDS of the clip locked to specific images
Fails at- Single-frame image-to-video — when you only have a starting frame, Wan 2.1 wins on quality. Swap back.
- Source images that don't match framing — both stills must have identical camera, crop, and subject position or the morph looks broken
- Longer than 5s — caps at 5 seconds output; chain multiple clips for longer morphs
- Wan 2.2 base model doesn't fit most consumer GPUs in full precision — we ship the Q5_K_M quantized build (compressed version that runs on smaller GPUs at slightly lower quality)
Pairs with- Two Flux stills (year 0 fresh + year 25 aged) → Wan 2.2 FLF2V = local weathering time-lapse, no cloud, no per-second billing
- When you only have ONE keyframe and need motion, swap back to Wan 2.1 — newer is not always better; pick by capability needed, not version number
HunyuanVideo (local)
localRTX 409024GVRAM16G · fits−0% vs 4090Tencent's HunyuanVideo. Strong on natural motion — body movement, physics, walking, gestures. Slower than Wan: 6-12 min per 5-second clip on a 4090. ~13 GB on disk.
Face7.0Motion8.4Identity7.0Best for- Action shots, walking, body physics, gesture-heavy scenes
- When Wan's motion looks stiff and you need realistic limb movement
Fails at- Slower than Wan — heavier model
- Identity lock weaker than Wan on tight close-ups
- No native lip-sync
Pairs with- HunyuanVideo for motion + Wan 2.2 keyframe + identity-lock LoRA
LTX-Video (local)
localRTX 409024GVRAM12G · fits−0% vs 4090Lightricks' LTX-Video. Small, fast, lower quality. ~9 GB on disk, runs on as little as 12 GB GPU memory. ~1 min per 5-second clip on a 4090.
Face6.0Motion6.0Identity6.0Best for- Drafting motion direction before committing to a slow model
- Weak GPUs that can't load Wan or Hunyuan
- When you just need to see if the camera move idea works
Fails at- Visibly lower quality than Wan 2.x or Hunyuan
- Identity preservation weaker on close-ups
- Less detailed motion — best for simple shots
Seedance 2.0 (cloud — rescue valve)
cloudByteDance's Seedance 2.0. Top-tier face lock + cinematic VFX. Cloud-only — billed per second of output. The rescue valve when local misses on a critical shot, not the default.
Best for- Premium brand finals where face lock under heavy motion is non-negotiable
- Cinematic VFX and lighting consistency
- Lip-sync passes (native muxed audio support)
Fails at- Mediocre on hands when subject gestures
- Backgrounds drift on clips longer than ~6 seconds
- Per-second billing — failed renders still consume the credit
Pairs with- Wan 2.2 for drafts + Seedance for the keeper finals
- Seedance Standard for finals, Seedance Fast for A/B drafting
Gemini Omni (cloud — physics polish)
cloudGoogle DeepMind's multimodal video model. Conversational editing, physics-aware motion, world-knowledge compositing. Subscription-gated through Google AI / Google Flow. Polish layer for specific hero beats, never the farm engine — cloud structurally can't farm.
Best for- Hero shots that need physics simulation (water, gravity, kinetic objects) local models can't fake
- Compositing beats that need world-knowledge context (a specific historical scene, a cultural reference)
- Iterative conversational refinement on a SINGLE keeper after the local farm has settled the direction
Fails at- Farming — Google AI subscription + per-render metering breaks unit economics at any scale beyond one-off hero shots
- Privacy work — content leaves the machine
- Access — gated behind Google AI subscription, region-limited, allowlist-controlled
Pairs with- Local farm (Wan + LoRA) for the cohort + Omni polish on one or two hero beats per spot
- Use AFTER the render direction is finalized locally — never as the drafting engine
How identity locks across shots
The hardest problem in AI video isn't generating one good frame — it's generating the SAME face across fifty frames, twenty shots, six campaigns. Hybrig gives you five layers, ranked by lock strength: trained LoRA (strongest, slowest to set up) down through PuLID, Redux, IPAdapter, to wardrobe-lock (weakest, narrowest scope).
Per-character LoRA
Train a small adapter (~50-200 MB .safetensors) on 15-30 reference photos of one person. The trained file lives on your drive forever — no subscription, no vendor lock-in.
Best for- Strongest identity lock available on Hybrig — locks across many shots
- Repeat characters across a series, a campaign, a multi-scene project
- Portable: the .safetensors is a file you own
Fails at- Needs 15-30 varied photos — different lighting, angles, expressions, wardrobe. 10 photos in one outfit collapses the LoRA to a generic face.
- Watch for EXIF orientation: rotated photos train the LoRA on sideways faces. Hybrig auto-corrects, but check your inputs.
- 1-2 hours of GPU time to train (one-time cost). Renting a SaaS face slot is faster up front, but you pay forever.
Pairs with- LoRA + Flux dev = strongest still-image identity lock
- LoRA + Wan 2.2 = identity-locked video on free local
- LoRA + IPAdapter (SDXL) when the LoRA is on an SDXL base
PuLID-Flux
Single-photo face-identity conditioning for Flux. Conditions on face geometry only — doesn't drag the reference's backdrop, pose, or clothing into the output. The scene prompt actually steers the render.
Best for- Putting a face into a NEW scene from one reference photo (no LoRA needed)
- Quick identity conditioning when you don't have 15-30 photos to train a LoRA
- Recommended strength 0.9-1.0 (per the PuLID-Flux v0.9.1 README)
Fails at- Identity lock not as tight as a trained LoRA across many shots
- Flux-only — does not work on SDXL family or Wan video models
- Requires the ComfyUI-PuLID-Flux-Enhanced custom node + InsightFace + EVA-CLIP weights
Pairs with- PuLID-Flux + scene prompt at strength 0.9 = new-scene character generation from a single photo
- PuLID-Flux takes precedence over Redux when both are available — PuLID locks identity, Redux pulls style
Flux Redux
Black Forest Labs' style-and-composition conditioning for Flux. Stronger than IPAdapter for Flux because it ships native — no XLabs/PuLID custom node required. Runs on stock ComfyUI.
Best for- Variations of an existing photo — same vibe, different angle
- Style transfer when you have a reference image whose look you want to keep
- Quick drop-in identity conditioning when PuLID isn't installed
Fails at- Redux remakes the entire reference image — same backdrop, same pose, same clothes. Use it for variations of an existing photo, not for putting your face in a new place.
- Won't respect the scene prompt as strongly as PuLID — the reference image's composition wins
- Not a hard identity lock the way PuLID-Flux or a trained LoRA is
Pairs with- Redux for style + a trained LoRA for identity = best of both
- Skip Redux if you have PuLID — PuLID is the better identity path
IPAdapter (SDXL)
Reference-image conditioning for the SDXL family (base, Pony, Illustrious). The current canonical file is ip-adapter-plus_sdxl_vit-h.safetensors via cubiq's ComfyUI_IPAdapter_plus.
Best for- SDXL / Pony / Illustrious reference-image conditioning
- Style + composition reference on stylized models
- Adjustable strength via the IPAdapterAdvanced weight knob
Fails at- SDXL-family only — Flux uses Redux or PuLID instead
- Identity lock weaker than a trained LoRA on close-ups
- Naming varies a lot — Hybrig probes ComfyUI's /object_info to find whatever's actually installed instead of hardcoding a filename
Wardrobe-lock (Gemini)
cloudGoogle Gemini 2.5 Flash Image ("nano-banana") via Hybrig's image-edit provider. Person photo + outfit photo in, person wearing the new outfit out — face preserved.
Best for- Swapping outfits without changing the face
- Fast (single API call, ~10-20 sec) compared to re-rendering from scratch
- Honest about being cloud — Gemini is the only model in this category that solves wardrobe swap reliably right now
Fails at- Cloud-only — files leave the device for this one step (Gemini terms apply)
- Outfit fidelity drops on heavily patterned or branded garments
- Doesn't change pose or scene — wardrobe only
Pairs with- Gemini wardrobe-edit → re-shoot the edited image into a LoRA training set when you want a new wardrobe permanently locked into the character
Cloned, recorded, or local
Hybrig's voice path is honest: ElevenLabs is the production clone today (cloud, $22/mo), bring-your-own-audio is the free path if you'd rather record yourself, and local TTS (XTTS / Piper) is on the roadmap as the privacy-first replacement for ElevenLabs.
ElevenLabs (cloud)
cloudProduction voice-clone TTS path. Eleven Multilingual v2 by default. ~$22/mo subscription. Hybrig clones once from a 30-second sample, then synthesizes any script.
Best for- Voice cloning from a 30-second reference recording
- Multilingual scripts (v2 model)
- Production-grade prosody and emotion
Fails at- Cloud-only — script + voice sample upload to ElevenLabs servers
- Subscription required ($22/mo at writing) — unlike the rest of Hybrig, this one isn't free
- Failed renders still consume credits
Pairs with- ElevenLabs voice clone → Wan 2.2 video + lipsync pass = full local-first spokesperson on free local video, paid only for the voice
Bring-your-own audio
Skip TTS entirely. Upload a pre-recorded audio file and Hybrig muxes it onto the video. Authentic voice, no clone artifacts, no ElevenLabs cost.
Best for- When you'd rather record yourself than clone yourself
- Languages or accents the TTS doesn't nail
- Zero-cost voice path — the upload is local, the mux is local
Fails at- Requires you to actually record clean audio (Sony A7 IV's XLR or any USB mic does the job)
- No clone — you have to record every script yourself
Local TTS (XTTS / Piper)
roadmapFully local voice synthesis. Not yet wired into Hybrig's pipeline — flagged here because James plans to add it. XTTS clones from a short reference; Piper is fast non-clone TTS.
Best for- Privacy-required projects where the voice can't leave the machine
- Offline workflows (planes, no-wifi shoots, Faraday environments)
- Replacing the $22/mo ElevenLabs cost when quality is good enough
Fails at- Not yet integrated — currently a roadmap item, not shippable
- XTTS clone quality below ElevenLabs on prosody and emotion
- Piper sounds robotic next to a clone — it's a non-clone TTS
What stacks, what conflicts
Identity layers don’t all play nice together. Two layers conditioning the same Flux pathway will fight; a Flux layer and an SDXL layer just don’t see each other. Below: which combinations stack cleanly, which conflict.
| Layer | LoRA | PuLID-Flux | Flux Redux | IPAdapter (SDXL) |
|---|---|---|---|---|
| LoRA | — | stacks | stacks | stacks |
| PuLID-Flux | stacks | — | conflicts | — |
| Flux Redux | stacks | conflicts | — | — |
| IPAdapter (SDXL) | stacks | — | — | — |
- LoRA: LoRA is the strongest layer. Anything else stacks on top — LoRA owns identity, the other layer adds style or reference.
- PuLID-Flux: PuLID and Redux both condition the same Flux pathway — Hybrig picks PuLID when both are available because PuLID respects the prompt; Redux drags the reference image's composition.
- Flux Redux: Redux + LoRA stacks fine — Redux pulls style, LoRA holds identity. Redux + PuLID conflicts (same conditioning slot).
- IPAdapter (SDXL): IPAdapter is SDXL-family only. It stacks with SDXL-base LoRAs but never sees a Flux model.