/local LLMs
The private AI coding box, why not just use XYZ?)
A common question is why run InferHaven instead of Ollama, or Continue, or Tabby, or a cloud IDE? Most of those things run on InferHaven. Here's the honest map of what we are, what we aren't, and when you should pick something else.
Common questions that come up regarding InferHaven are “how does it work” or “is it any good.” It’s some version of: why would I use this when I could just use XYZ? And XYZ is always something real and good — Ollama, Continue, Aider, Cursor, Tabby, a cloud IDE, the docker-compose file you already half-wrote one Saturday.
It’s a fair question, so I want to answer it properly. And the honest answer starts with a twist: pull up your list of XYZ’s. For most of them, the answer is “you already use XYZ — and they run on InferHaven.”
The tools aren’t the competition
Here’s the thing people miss when they line InferHaven up against Ollama or Continue. Those aren’t rivals we’re trying to beat. Ollama is the main inference engine inside InferHaven currently. Continue is one of the assistants we install and wire up for you. Aider, Cline, Goose, OpenCode, Avante, Cursor, Claude Code — they all speak the OpenAI protocol, and InferHaven hands them a local endpoint to speak it to.
So when someone asks “why not just use Continue?” — use Continue. Please. It’s great. The question that actually has teeth is the next one.
The real question: why not build it myself?
This is the honest competitor. Not a product — a weekend. You know the shape of it: install Ollama, pull a model, stand up a web UI, install the IDE extension, paste the localhost endpoint into its config, do that again for the second assistant, set up a reverse proxy so it’s reachable, bolt on HTTPS, sort out SSH keys, write a little script to back up your home directory, and — there, you have it. A private AI coding setup. It works. You built it. It’s yours.
And it is genuinely yours to maintain forever. The day a better model drops and you ollama pull it, you go re-edit every assistant’s config to point at the new name. The day code-server updates and the extension host falls over, that’s your evening. None of it is hard. All of it is a tax.
Build it yourself
- Wire Ollama, a UI, and each assistant's config by hand
- Re-point every assistant's endpoint on each new model
- Own the reverse proxy, HTTPS, SSH, and backups
- It works — but it's a project that's never quite done
Boot it whole
- docker compose up -d brings the entire stack up
- Assistants re-synced to your models on every pull
- Caddy HTTPS, key-only SSH, tmux, backups handled
- It's just Docker — the exit door is never locked
The whole pitch of InferHaven is that we did the weekend, and the boring part after the weekend — the keeping-it-in-tune part — so you don’t have to.
$ docker compose up -d
[+] Running 4/4
✓ ollama Started
✓ workspace Started
✓ code-server Started
✓ caddy Started
$ haven pull gemma4:12b
pulling manifest... downloading 7GB ✓
success
[InferHaven] Model gemma4:12b ready.
[03:00:32] [haven] aider: synced
[03:00:32] [haven] opencode: synced
[03:00:32] [haven] avante: synced
[03:00:32] [haven] pi: synced
[03:00:32] [haven] continue: synced
[03:00:32] [haven] qwencode: synced (merge)
[InferHaven] Auto-tuning for coding assistant use...
[InferHaven] Tuning gemma4:12b(family: generic)...
Uploading tuned Modelfile...
[InferHaven] 'gemma4:12b' context window optimised. Model unloaded — active on next chat.
Context window: set to 32768 tokens (was: unset)
Template and stop tokens: preserved from model defaults (unknown family).
Flash attention: on
KV cache quant: q8_0 (~50% less KV-cache VRAM)
# every installed assistant is already pointed at it. nothing to re-wire.
Where the batteries actually are
If there’s one thing I’d point to as the part that’s genuinely hard to reproduce by hand, it’s this: InferHaven installs and pre-configures up to ten coding assistants, and seven of them re-render their config automatically every time you pull a model — opencode, aider, qwencode, pi, goose, continue, and avante. Pull deepseek-coder-v2, and the next time you open any of them, it already knows the model exists and is pointed at it.
That sounds small. It’s the exact thing that rots in a DIY setup, because it’s not one config — it’s N configs, in N different formats, that all drift the moment your model list changes. Doing it once is easy. Keeping it true for multiple models and numerous assistant harnesses is the real work.
Anyone can assemble the stack once. The hard part is keeping seven assistants honest about your model list, forever. That’s the part we automated.
— The actual moat
When you should not use InferHaven
I’d rather tell you this than have you find out after. There are real cases where something else is the better call, and pretending otherwise would insult the kind of person who reads this far.
If what you want is code completion from a self-hosted model, full stop — you live in your own IDE, you just want good local autocomplete and nothing else — then Tabby is a focused, well-built tool that does exactly that. It’s a clean choice. InferHaven isn’t a better completion engine than Tabby; it’s a different thing entirely.
One clean job
- Self-hosted completion server, done well
- You bring your own IDE
- One assistant, one surface
- Great if autocomplete is the whole ask
The whole box
- Model server + workspace + browser IDE in one stack
- Your pick of assistants, wired up at once
- Switch between them, or run several
- Completion is one of many things it does
And if you’re a large org that wants a managed fleet of cloud dev environments today, with a platform team and an SSO mandate, the cloud-CDE players have a head start on that exact problem. InferHaven Cloud is coming for the part of it we care about — private, GPU-backed, no per-token meter — but I’m not going to pretend it’s shipping this afternoon.
About the models, honestly
The other thing worth saying plainly: for the genuinely hard reasoning problems, the frontier cloud models are still ahead of what you’ll run on a box under your desk. I’m not going to tell you a 14B local model matches the best hosted model on the nastiest refactor, because it doesn’t, and you’d catch me lying the first time you tried it.
But at least two things are true. For the actual texture of day-to-day work — completions, edits, test scaffolding, “rename this everywhere,” “explain this stack trace” — Qwen 2.5 Coder, DeepSeek, and the Llama coders on right-sized hardware are genuinely good now, and the gap narrows with every release. Also, InferHaven was never local-only.
So, why InferHaven?
Because the question was never really “InferHaven or Ollama.” It’s “wire it all together yourself and maintain it forever, or run one command and get the assembled, tuned, private version.” If you love building the stack by hand — genuinely, some people do, and I respect it — the repo is right there and you can read exactly how every piece fits. If you’d rather skip to the part where you’re writing code on your own hardware with your own models, that’s what we’re for.
The box is the box. Your code stays on it. Your tools point at it. Nothing leaves unless you say so.
$ git clone https://github.com/InferHaven/inferhaven-core
$ cd inferhaven-core
$ cp .env.example .env
$ docker compose up -d
$ open https://localhost && ssh haven@localhost
Float your boat up to the dock, clone the repo, and if you want the managed version when it’s ready, the waitlist on the homepage is the way in. The lighthouse is operating. The beacon is active.
— Ethan L.