> /inferhaven

A safe haven
for AI inference.

Your own private AI coding server — local models, your editor, your terminal, on hardware you control. No per-token meter, no code leaving your perimeter. Self-host it, or boot one in our cloud.

haven up /reserve your workstation try in codespaces →

FSL-1.1-Apache-2.0 · zero telemetry

/TWO JOURNEYS, ONE CHARTED COURSE

Choose your harbor.

Both stacks run the same workspace, the same models, the same tools. The difference is who owns the ship, you're still the captain.

$0 forever

/core · self-host

Run it on your hardware.

The full self-hostable stack. Ollama, terminal-first workspace, code-server IDE, OpenAI-compatible API. Your code never leaves your machine.

10 coding assistants pre-configured
OpenAI-compatible API on localhost
Multi-user SSH + tmux workspace
GPU auto-detect (NVIDIA + AMD ROCm)

$ haven up view docs →

beta · reservations open

/cloud · personal workstation

A workstation that's yours, in the cloud.

A persistent private dev environment with secure access, durable storage, and optional GPU power on demand. Boots in seconds, picks up exactly where you left off — no session roulette, no token meter.

Persistent home directory + secrets
Boot in seconds, resume your tmux session
Optional GPU tiers — fair-use, no per-token bills
Bring your own keys on top, no lock in (OpenAI, Anthropic, etc..)

/reserve your spot see pricing details

/WHY INFERHAVEN

Three things we will not compromise.

/privacy by default

Your code stays on your ship

Inference can run entirely on your hardware — local on Core, your private workstation on Cloud. Nothing leaves your perimeter unless you opt it in. Zero telemetry. No public listeners.

0 bytes shipped to vendor APIs unless you explicitly say so.

/YOUR TOOLS, YOUR METHODS

We're the box, not another tool in it.

OpenAI-compatible gateway on localhost. Point Cursor, Cline, Continue, Claude Code, Aider, OpenCode, Goose, Avante, Qwen Code, Pi — all at the same local models. InferHaven isn't competing with your assistant; it's the harbor every one of them docks in.

Up to 10 assistants installed and pre-pointed at your models; 7 re-synced automatically on every model pull.

/PERSISTENCE

Your workspace stays yours.

Flat monthly. Persistent home, persistent secrets, persistent shell sessions. No per-token meter, no session timeouts, no "your container was reclaimed" at 2am. The closest thing to owning the machine without buying the rack.

Subscription that feels like ownership — not a metered tab.

GitHub Copilot
Cursor
Claude Code
Cline
Continue
Aider
Zed
JetBrains AI
Cody
Tabby
Codex CLI
Goose
OpenCode
Avante.nvim
Crush
Plandex
Open WebUI
LibreChat
AnythingLLM
LobeChat
LangChain
LlamaIndex
Vercel AI SDK
openai-python
openai-node
LiteLLM
Qwen Code

/WORKS WITH

Not just another ship. The dock they all tie up to.

InferHaven speaks the OpenAI schema natively, so the IDE plugins, CLIs, and agents you already use talk to it with a one-line endpoint swap. It isn't competing for your editor — it's the harbor they all dock in: your models, your machine, every tool pointed at the same local gateway. Privacy by default, security baked in.

/swap one line

# point any OpenAI SDK at your local InferHaven
export OPENAI_BASE_URL="http://localhost/v1"
export OPENAI_API_KEY="haven"  # local server, any string works

/any + OpenAI-compatible client

/WHERE IT FITS

InferHaven is the box, not another tool in it.

Most things you'd compare it to actually run on top of it. The only real question is whether you build the stack yourself, or boot it whole.

/THE STACK, ASSEMBLED

Inference runtime Ollama, out of the box — or swap in vLLM or LocalAI if you'd rather.
Private gateway OpenAI-compatible API on localhost. Nothing listens on a public interface unless you add it yourself.
Workspace SSH + mosh, tmux with auto-save, your dotfiles, root on your own machine.
Browser IDE Full VS Code in the browser via code-server.
Coding assistants Continue, Aider, Cline, Goose, OpenCode, Avante and more — installed and pre-pointed at your models.
Edge Caddy reverse proxy with automatic HTTPS.
Fleet control plane Optional — provision and manage remote GPU boxes from one dashboard with InferHaven Cloud.

Each of these is something people self-host on its own. InferHaven is all of them — assembled, wired together, and kept in sync.

Roll your own

Build it yourself

Hand-wire Ollama, a web UI, and each assistant's config
Re-edit every assistant's endpoint on each new model
Own the HTTPS, SSH, tmux, and backups yourself
It's all yours to debug at 2am

InferHaven

Boot it whole

One command — docker compose up -d — and the whole stack is live
Assistants auto-pointed at your local models, re-synced on every pull
HTTPS, SSH, tmux, and backups handled for you
Same exit door as the front door — it's just Docker

read: the private AI coding box → view the source →

/SEE IT RUN

One command. The whole stack.

From git clone to the first model response in ~3 minutes. Point any OpenAI-compatible tool at localhost/v1. Your PC / laptop is the model server.

haven@laptop · ~/inferhaven-core

haven@laptop $ git clone https://github.com/InferHaven/inferhaven-core

Cloning into 'inferhaven-core'... done.

haven@laptop $ cd inferhaven-core && cp .env.example .env

haven@laptop $ docker compose build workspace

[+] Building workspace image locally... done ✓

haven@laptop $ docker compose up -d

[+] Running 4/4

✓ ollama Started

✓ workspace Started

✓ code-server Started

✓ caddy Started

haven@laptop $ ssh -p 2222 haven@localhost

Welcome to InferHaven · tmux session 'Haven' restored

haven@haven $ haven pull qwen2.5-coder:14b

pulling manifest... downloading 8.5GB ✓

haven@haven $ curl http://localhost/v1/chat/completions \

$ -d '{"model":"qwen2.5-coder","messages":[{"role":"user","content":"hi"}]}'

{"choices":[{"message":{"role":"assistant","content":"Hello, world."}}]}

haven@haven $ haven chat qwen2.5-coder:14b

> write a tiny haiku about safe harbors

Anchor holds the line,

lighthouse cuts the storm in two —

code sleeps in the bay.

/TRY IT

3 portholes, running the same InferHaven behind each.

No account, no credit card, no email gate. Pick the path that's least friction, and be inferencing in under two minutes.

/fastest

GitHub Codespaces

One-click in your browser. Full devcontainer with Docker-in-Docker. ~90 seconds to a running InferHaven. Test demo models on Codespaces CPU (no GPU here).

▶ Launch in Codespaces

/local

Self-host locally

Clone the repo, run docker compose up — or reopen the same devcontainer in VS Code. Runs on your machine, uses your own GPU.

docker compose → VS Code devcontainer →

/soon

SSH demo

Anonymous read-only InferHaven you can SSH into for a quick poke. Coming soon!

join waitlist →

/RESERVE YOUR WORKSTATION

Be first to boot
your private workstation.

A persistent dev environment with optional GPU, in the region you choose. No credit card to reserve. We email you when your tier opens — founding members get our best early-adopter pricing, and we'll always give advance notice before any change.

See tentative tier breakdown ↓

Your workspace, your data — encrypted at rest, never used to train models
Bring your own LLM keys (OpenAI, Anthropic) for hybrid local + remote flows
Resume your session in seconds, No session lockouts.

/PRICING · TENTATIVE

Flat monthly. Like owning the machine.

Start free, then pick the GPU power you want. No per-token bills, no minute-by-minute rental — your workspace stays yours.

Pricing below is tentative — final tier names and numbers may shift before billing opens in beta. Reserve a spot for founding-member pricing: we'll honor the most generous rate we can sustain and always give advance notice before any change.

no charge until upgrade

FREE TRIAL

$0 14 days

Try a real workstation with a tracked GPU-hour budget. We verify your card to prevent abuse but never bill until you choose to upgrade. Convert any time to keep your work; let it expire and your workspace archives for 30 days.

Full Workspace + 7 GPU hours (tracked live)
GPU-hour meter visible in dashboard
25 GB temporary storage
Bring-your-own LLM keys
Upgrade keeps your environment intact

$ join waitlist

WORKSPACE

$19 / month

A persistent dev environment with secure remote access, durable storage, and a gateway for your own LLM keys. No GPU included.

Persistent home + secrets vault
SSH + browser IDE (code-server)
50 GB durable storage
Remote LLM gateway (your OpenAI / Anthropic keys)
Custom domain support

/reserve workspace

$29 / month

More room and a second seat for solo devs with bigger projects or small teams sharing a single environment. Same persistent dev workstation, more headroom.

Everything in base Workspace
100 GB durable storage
2 included seats
Higher LLM-gateway request budget
Priority support response

/reserve workspace+

GPU

$59 / month

Modest GPU access for local coding models, code assistants, embeddings, and occasional inference. Generous fair-use.

Everything in Workspace
Shared GPU pool — curated catalog models, no per-token bills, no session timeouts
7B–14B local models, common harnesses
Generous fair-use; we only throttle clearly non-interactive abuse
100 GB durable storage

/reserve GPU Personal

$149 / month

Stronger GPU access with priority scheduling for larger local models and heavier coding assistant workflows.

Everything in GPU Personal
Priority scheduling — higher-tier GPU
32B+ models, large context windows
250 GB durable storage
Multi-user workspace (up to 3 seats)

/reserve GPU Studio

DEDICATED

$399 / month

Reserved single-tenant GPU — 20 GB VRAM, your machine alone. Comfortable home for 14B local models or 32B with Q4 quant. No sharing, no queue, no per-token meter.

20–24 GB VRAM, single-tenant
No sharing, no queue, no per-token meter
Unlimited fair-use — your machine, your rules
500 GB durable storage, daily encrypted backups
Audit logs + RBAC + custom domain support

/reserve dedicated

$799 / month

Reserved single-tenant GPU with 48 GB VRAM. Run 32B at Q8, 70B at Q4, large context windows. The ceiling for serious local-model AI dev work.

Everything in base Dedicated
48 GB VRAM, single-tenant
32B at Q8, 70B at Q4 comfortable
1 TB durable storage, hourly snapshots
SSO + SOC 2 path + white-glove onboarding

/reserve dedicated pro

ENTERPRISE your requirements

Custom / contact us

For teams with specific compliance, data-residency, or hardware needs — regional single-tenant placement, bare-metal isolation, 80 GB+ class GPUs, and negotiated SLAs. We scope it to you.

Regional data-residency + single-tenant isolation
80 GB+ class GPUs (A100 / H100 on request)
SSO, audit logs, RBAC, custom SLA
Dedicated onboarding + private support channel
Annual or committed-use pricing

/contact sales

BYO-COMPUTE your cloud, our stack

$79 / server / month

For teams that already have AWS, GCP, Hetzner, or on-prem GPU capacity. We install our agent, run your InferHaven dashboard, monitor and update. You pay your own cloud bills direct — we just manage.

Run on your AWS / GCP / Hetzner / bare-metal
We install, configure, monitor, update
Full InferHaven dashboard + agent included
Same SLA-grade support as Dedicated
Billed flat per managed server

/contact for BYO setup

All cloud tiers include daily backups, encrypted storage, your data stays in the region you pick (GPU regions roll out through beta), and direct support is included on every tier. Fair-use means no per-token bills and no session timeouts — generous, with throttling only for clearly non-interactive abuse. If you outgrow Studio, Dedicated is built for you. common questions ↓

/FAQ

The questions we get most.

What exactly do I get with InferHaven Cloud?

A persistent private workstation — like a dev machine in the cloud that's yours. SSH in, write code, run your tools, pick up the same tmux session tomorrow. Optional GPU tiers add local model inference on top.
How does the free trial work?

Sign up with no charge — we verify your card to prevent abuse but don't bill until you upgrade. You get 14 days with a full Workspace plus 7 GPU-hours of compute, tracked live in your dashboard so you always know what's left. Upgrade any time to convert your environment to a paid tier — your code, secrets, and dotfiles carry over untouched. If you don't upgrade by day 14, the workspace archives for 30 days; resume in that window or it's purged. Note: hourly GPU limits and trial allotments are tentative and may change before beta release.
Is the GPU unlimited?

Shared GPU tiers run on a pool with generous fair-use — no per-token bills and no session timeouts, unlike metered or session-limited services. We keep it fast by fair-scheduling under load, and only throttle workloads that are clearly non-interactive (24/7 max-throughput batch or training). Dedicated tiers are single-tenant — your card, genuinely unlimited. Workspace has no GPU, for cloud-model users.
Am I billed per token, per minute, or per request?

No. Flat monthly subscription. You're paying for the workstation, not the compute meter. The only variable cost is anything you opt into yourself, like an OpenAI or Anthropic API key you choose to wire up.
What happens to my workspace if I pause my subscription?

Paid tiers: your home directory, secrets, and storage are preserved for 90 days while paused — life happens, and we don't want a hard month to cost you your environment. Resume anytime in that window and pick up exactly where you left off. Free trial workspaces archive for 30 days after expiry. After the retention window, data is purged unless you've exported it.
Where does my data live?

You pick the region at signup. Storage is encrypted at rest; the network is inside our VPC and never exposed to the public internet by default. GPU availability expands by region through beta, so some GPU tiers start in a subset of regions. Your code is yours — we don't train on it, scan it, or share it.
How is this different from a Codespaces / cloud IDE?

The IDE is just one surface. Underneath, you get a real persistent workstation: SSH, tmux, root, your dotfiles, your tools, your shell. You're not in someone else's sandbox — you're in your own machine.
Why not just run Ollama and Continue myself?

You absolutely can — and if you do, you've built the first two rooms of what InferHaven ships whole. We're Ollama plus a terminal-first workspace, a browser IDE, auto-HTTPS, SSH key management, and up to ten coding assistants installed and pre-pointed at your local models — re-synced automatically every time you pull a new one. The DIY stack is a weekend of wiring and a maintenance tab that never closes; this is `docker compose up -d`. Same parts, assembled, kept in tune.
How is this different from Tabby or a self-hosted Copilot?

Tabby and friends are excellent at one job: code completion from a self-hosted model. InferHaven isn't trying to be a better completion engine — it's the whole machine that engine runs on. You get the model server, the workspace, the IDE, and your pick of assistants (Continue, Aider, Cline, Goose, and more) wired up at once, so you can switch between them or run several at once. If all you want is autocomplete, Tabby is a clean choice. If you want a private coding box you live in, that's us.
Do local models actually keep up with the cloud ones?

Honestly? For the hardest reasoning, the frontier cloud models are still ahead — usually by a few months. But for day-to-day coding, Qwen 2.5 Coder, DeepSeek, and Llama on right-sized hardware are genuinely good, and the gap closes every release. And InferHaven isn't local-only: wire up your own Anthropic or OpenAI key and route the gnarly refactor to the cloud per-task, while everything sensitive stays on your box. You pick what leaves, and when.
Can I run my own models, or do I have to use yours?

Your hardware tier, your call. Pull any model the GPU class supports (Qwen 2.5 Coder, Llama 3, DeepSeek, etc.). The OpenAI-compatible gateway speaks the same protocol so every harness -- Cursor, Claude Code, Continue, Aider, OpenCode, and more -- works without reconfig.
What about the self-hostable version?

InferHaven Core is the same stack, self-hostable on any machine with Docker. The Cloud product is built ON Core — you can move between them without rewriting anything. Many users will run Core on a home GPU and the Cloud workstation for roaming or specific access needs.
Is InferHaven open source? What license is it under?
InferHaven Core ships under FSL-1.1-Apache-2.0 (Functional Source License). The code is on GitHub and you can read, run, modify, and self-host it today — but it's source-available, not OSI open source. Each release converts to Apache-2.0 automatically two years after publication. FSL restricts one thing: building a managed-hosting service that competes head-on with InferHaven Cloud during that 2-year window. Everything else — personal use, internal commercial use, modification, consulting — is permitted.
- Fair Source explainer (fsl.software) →
- Why we picked FSL →

/CHECK THE SHIP LOGS

What we're shipping.

all posts →

A safe haven
for AI inference.

Choose your harbor.

Run it on your hardware.

A workstation that's yours, in the cloud.

Three things we will not compromise.

Your code stays on your ship

We're the box, not another tool in it.

Your workspace stays yours.

Not just another ship. The dock they all tie up to.

InferHaven is the box, not another tool in it.

Build it yourself

Boot it whole

One command. The whole stack.

3 portholes, running the same InferHaven behind each.

GitHub Codespaces

Self-host locally

SSH demo

Be first to boot
your private workstation.

Flat monthly. Like owning the machine.

The questions we get most.

What we're shipping.

InferHaven's source code is live!

The private AI coding box, why not just use XYZ?)

Why InferHaven runs your AI offline by default

A safe haven for AI inference.

Run it on your hardware.

A workstation that's yours, in the cloud.

Your code stays on your ship

We're the box, not another tool in it.

Your workspace stays yours.

Build it yourself

Boot it whole

GitHub Codespaces

Self-host locally

SSH demo

Be first to boot your private workstation.

InferHaven's source code is live!

The private AI coding box, why not just use XYZ?)

Why InferHaven runs your AI offline by default

A safe haven
for AI inference.

Be first to boot
your private workstation.