Intermediate6 min readBrain slot

When to use a local LLM and when to reach for cloud

Hybrig is hybrid. The LLM brain slot is a swappable part, the same way the render engine and the voice engine are. Local wins on most jobs. Cloud wins on a specific one. Here’s how to tell which is which.

The setup

Hybrig — Hybrid plus Rig — ships with a pluggable brain slot. You pick the LLM the same way you pick the render engine: local (Llama, Mistral, Qwen via Ollama, Gemma) or cloud (Claude, GPT). Cloud isn’t a compromise in this product. It’s a deliberate part of the stack.

The question is never “local vs cloud as ideology.” It’s “which tool wins for THIS specific job.” The render steps — Flux, F5-TTS, ComfyUI, Remotion — stay local. That’s where the unit-cost math destroys cloud economics, and where farming becomes possible. The LLM-as-rewriter step is the one place where the math flips for a specific kind of job.

Where local LLMs win

Local models in the 7B–13B range are excellent at a long list of jobs — and on your rig, every one of those jobs is real-time and free. Anything that fits this shape runs great:

  • Short text generation. Headlines, hooks, captions, alt-text. One clean prompt, one clean output.
  • Summarization. Long input, short output. The model isn’t fighting to remember ten different rules — it’s compressing.
  • Classification and tagging. Pick a label from a set, attach a tag, route a record. Small models nail these.
  • Basic Q&A on a known doc. Retrieval-augmented questions over your own knowledge base.
  • Voice transcription. Whisper-family models run on your GPU and never send a syllable out.
  • Embeddings. Semantic search, similarity ranking, vector indexing. Pure local territory.

Llama 8B, Mistral 7B, Qwen 2.5, Gemma — your rig handles them all in real time. Zero marginal cost. Your data never leaves the machine. This is the default and it covers most of what an LLM gets asked to do inside a creative pipeline.

Where local LLMs hit their limit — and why

There is one specific shape of LLM job where small local models materially under-perform, and it matters because it’s the shape that shows up in cohort farming with deep per-prospect personalization. Four reasons, in plain language.

Reason 1 — constraint-holding capacity

Parameter count is roughly proportional to how many rules the model can hold in mind during a single generation. An 8B model can usually juggle two or three active constraints cleanly. A frontier cloud model holds ten or fifteen. When your sales-script rewriter has eight different “do not say X” rules per prospect plus structural cadence rules plus a locked closer line, the smaller model starts dropping constraints silently — and you don’t notice until you spot-check the output.

Reason 2 — negation handling

“Do NOT name Jonathan as the owner” is one of the hardest categories of instruction for small models. Negation is a well-documented weakness of sub-30B models — they hit some negations and miss others, with no warning sign in the output. Frontier models handle long negation chains far more reliably because their instruction-tuning data covered the pattern at much higher density.

Reason 3 — instruction-tuning depth

Frontier cloud models have been trained on orders of magnitude more multi-constraint instruction-following examples than the open-weights local models you can run today. Open-weights are catching up fast, but the data gap is real. Smaller models know the shape of a constrained response — they lose precision when pressure stacks up.

Reason 4 — long-context effective attention

Even when a local model advertises a long context window, its effective attention degrades across long structured inputs. A ten-field structured record per prospect, fed in as JSON or markdown, is exactly the kind of input where the model starts forgetting fields three and seven by the time it’s writing paragraph four. Frontier models actively reason across long context instead of just storing it.

The honest trade-off math

When you’re farming a cohort of one to two hundred prospects with watchOut-grade personalization, the math runs like this. The numbers below are illustrative — pulled from our own testing on cohort jobs the same week we wrote this — not promised:

  • Cloud LLM call: pennies per prospect once prompt caching is on. A 184-prospect cohort lands roughly at $4–$9 in Claude usage, one time. Illustrative range; your prompt size and cache-hit ratio move it.
  • Local LLM call: zero marginal cost. But the failure rate on multi-constraint generation in our testing was roughly 15–30% on 8B–13B class models — meaning a meaningful slice of the cohort ships a script with a dropped constraint that you either catch in QA or fix manually.
  • The renders themselves stay local. Flux, F5-TTS, Remotion, ComfyUI. That’s where, in our testing, the cloud equivalent for a cohort of that size lands in a defensible $1,500–$3,000 range — and Hybrig moves it to approximately zero.

The trade, said plainly

Cloud LLM at the rewriter step in our $4–$9 illustrative range, on top of a render workload whose all-cloud equivalent would have been $1,500–$3,000, is a defensible trade. The farm is still local. The bottleneck step is the one place we spend.

Why we’re telling you this

Hybrig ships the receipts and lets the audience decide. When we tell you cloud is the right call for a step, that isn’t us pushing you toward spend — it’s us telling you what we’d do on our own cohort. Build the best thing for the least money. Share what works and what doesn’t.

If a future open-weights model in the 30B range handles eight-constraint negation chains cleanly, this lesson updates the week we re-test it. The brain slot stays pluggable so the recommendation can move when the evidence does.

The pluggable brain slot

Hybrig’s LLM slot accepts both local providers (Ollama, llama.cpp, LM Studio, Gemma) and cloud providers (Claude, GPT, others). The slot is swappable per task — you can run the summarizer on local Llama and the cohort rewriter on cloud Claude inside the same workflow. You pick. We tell you what we’d pick and why.

The render engine is pluggable the same way. The voice engine too. Interchangeable parts, every layer. That’s the whole shape of the product.

Try it

Open the Studio canvas, drop an LLM node onto the graph, and look at the provider dropdown. Local first, cloud listed alongside. Switch between them on the same prompt and compare outputs — that’s the shortest path to feeling the difference for your specific job.