Intermediate7 min readLipsync

Why your Seedance lipsync drifts on long audio — and the partitioning fix

The model isn’t broken. The audio you handed it is too long, in the wrong container, in one piece. Four layers of fix — chop, wrap, run-one-at-a-time, stitch — ship by default in Hybrig’s Seedance path.

The symptom

You hand Seedance 2.0 a thirty-second voiceover and a face image and ask for a talking-head clip. The first few seconds look great. Then the lip-sync starts to slip. By ten seconds in the mouth is moving but not in time with the words. By fifteen seconds the model has quietly reshaped the tempo to suit its own internal pacing, and on singing inputs it will start hallucinating lyrics that aren’t in the source audio at all.

That’s not a quality problem. The model is excellent. The model is being asked to do something it isn’t built for.

Why it happens

Seedance is a video model, not an audio-to-video translator. When you hand it raw audio, the model has to invent a timeline for what it’s seeing — frame rate, scene cadence, where one beat ends and the next begins. The longer the clip, the more degrees of freedom it has, and the more its internal pacing drifts away from the actual audio waveform. Past roughly the ten-to-fifteen-second mark, the drift becomes audible.

Two craft details compound the problem.

One, raw audio files don’t carry the metadata Seedance uses to anchor itself. A video file does — frame rate, duration, container timing — even when the visuals are blank. The fix: wrap the audio inside a video container before submitting. None of this is in fal.ai’s documentation.

Two, the model performs best on shots under about ten seconds, and best of all on close framing where the mouth is the dominant feature. Hand it a wide shot of someone talking for thirty seconds and the drift accelerates. Hand it a tight close-up for eight seconds and the sync is locked.

Layer 1 — Chop along phrase boundaries

Don’t hand Seedance the whole voiceover. Cut it into segments of about eight seconds each. The cut has to land on a natural phrase boundary — the silence between a sentence and the next, never mid-word — or the stitched output sounds chopped.

Why eight seconds and not ten? Two reasons. First, Seedance reliably lipsyncs under ten; past that it’s a coin flip. Second, the last few seconds of any Seedance render tend to go janky — lip droop, identity drift, the model running out of conditioning. Eight seconds keeps you comfortably inside the clean window with headroom.

How Hybrig finds the boundaries

The chopper runs an ffmpeg silence-detect pass over your audio, finds every pause longer than about 200ms, and cuts at the longest silence inside each eight-second window. If your input has no usable silences — rare, but possible on rapid-fire reads — the chopper falls back to a hard cut at the eight-second mark and logs a warning.

Layer 2 — Wrap each segment in a video container

Each chopped audio segment goes through an ffmpeg pass that mounts it onto a still image (your face reference, or a black frame) and writes an eight-second MP4. The visuals are the conditioning Seedance needs to anchor its timeline; the audio inside the container is the track it actually syncs to.

We picked ffmpeg over a Remotion composition deliberately for this step. The wrap is a still-image-plus-audio mux — no animation, no overlays, no conditional layout. ffmpeg does it in under four seconds per clip with one process and no Chrome boot. Remotion is the right tool when you need React-driven composition logic; for blank visual plus muxed audio, it would be overkill. The assembly step at the end of the pipeline is where Remotion earns its keep.

Layer 3 — Run Seedance one segment at a time

Hybrig submits one wrapped segment to Seedance, waits for the lipsynced clip to land, then submits the next. Never batches.

This isn’t a quality decision — it’s a money decision. Seedance is metered cloud compute. The fal queue has been known to silently park jobs and bill on retry; one bad batch fan-out can run a five-dollar render into a forty-dollar one. One job, watch for completion, next job. The whole pipeline is a sequence, not a parallel scatter.

Local renders fan out happily — your rig isn’t metered. Cloud renders go in single file.

Layer 4 — Stitch the segments back

The N lipsynced clips come back in order. A second Remotion composition stitches them into one continuous timeline with the original master audio laid over the top — same script the chopper started with, end-to-end — and optional crossfades at the segment boundaries to soften the cuts.

Two practical reasons to lay the master audio over the assembled video, instead of just concatenating Seedance’s per-segment audio:

  • Seedance can subtly re-encode the audio on its side. The master VO you authored is always cleaner than the round-tripped version.
  • The lipsync model doesn’t need the original; it only needs the cadence. Putting the original back at assembly time gets you the best of both worlds — clean audio, locked mouth.

Where this lives in Hybrig

The pipeline is wired in two places, and you don’t have to pick.

Regular product UI. The /tools/lipsync page (and any spokesperson render that hits Seedance) auto-runs the chop-wrap-one-at-a-time-stitch flow when your audio is longer than the clean-window threshold. You upload a file, click the button, the pipeline handles it. The render page shows the segment count and which one is currently in flight so nothing feels like a black box.

Studio node graph. The same four steps exist as visible nodes in the /studio palette — audio chopper, ffmpeg audio-to-video wrap, Seedance segment runner, Remotion assembler. Drop them on the canvas and wire them by hand when you want explicit control over segment length, wrap visuals, or the assembly composition.

Both paths run the same underlying code. The graph is the transparency layer; the button is the convenience layer.

Doing it by hand (the manual workflow)

If you’re working outside Hybrig — running Seedance directly against fal.ai, or testing the technique in your own NLE — the manual version is doable. Tedious, but doable.

  1. Drop your voiceover on a timeline in Premiere, DaVinci, or your NLE of choice.
  2. Slice it into segments of roughly eight seconds each. Cut at natural phrase boundaries, never mid-word.
  3. For each segment: place a black solid (or a still of your subject’s face) on a video track of matching length. Export the result as MP4 with audio muxed in.
  4. Upload each segment’s MP4 to Seedance as the reference video. Include the segment’s exact lyrics in the prompt as “The lyrics of the song are: <words>.”
  5. Submit one at a time. Wait for completion before submitting the next. Save the returned clips.
  6. Drop the returned clips back on your NLE timeline in order, lay your original master audio over them, optionally crossfade the cuts.

This is the workflow we’re replacing with one button. The manual path is here so the technique is understandable, not so you have to live with it.