> /inferhaven

A safe haven
for AI inference.

Your own private AI coding server local models, your editor, your terminal, on hardware you control. No per-token meter, no code leaving your perimeter. Self-host it, or boot one in our cloud.

FSL-1.1-Apache-2.0 · zero telemetry

/TWO JOURNEYS, ONE CHARTED COURSE

Choose your harbor.

Both stacks run the same workspace, the same models, the same tools. The difference is who owns the ship, you're still the captain.

$0 forever

/core · self-host

Run it on your hardware.

The full self-hostable stack. Ollama, terminal-first workspace, code-server IDE, OpenAI-compatible API. Your code never leaves your machine.

  • 10 coding assistants pre-configured
  • OpenAI-compatible API on localhost
  • Multi-user SSH + tmux workspace
  • GPU auto-detect (NVIDIA + AMD ROCm)
beta · reservations open

/cloud · personal workstation

A workstation that's yours, in the cloud.

A persistent private dev environment with secure access, durable storage, and optional GPU power on demand. Boots in seconds, picks up exactly where you left off — no session roulette, no token meter.

  • Persistent home directory + secrets
  • Boot in seconds, resume your tmux session
  • Optional GPU tiers — fair-use, no per-token bills
  • Bring your own keys on top, no lock in (OpenAI, Anthropic, etc..)

/WHY INFERHAVEN

Three things we will not compromise.

/privacy by default

Your code stays on your ship

Inference can run entirely on your hardware — local on Core, your private workstation on Cloud. Nothing leaves your perimeter unless you opt it in. Zero telemetry. No public listeners.

0 bytes shipped to vendor APIs unless you explicitly say so.

/YOUR TOOLS, YOUR METHODS

We're the box, not another tool in it.

OpenAI-compatible gateway on localhost. Point Cursor, Cline, Continue, Claude Code, Aider, OpenCode, Goose, Avante, Qwen Code, Pi — all at the same local models. InferHaven isn't competing with your assistant; it's the harbor every one of them docks in.

Up to 10 assistants installed and pre-pointed at your models; 7 re-synced automatically on every model pull.

/PERSISTENCE

Your workspace stays yours.

Flat monthly. Persistent home, persistent secrets, persistent shell sessions. No per-token meter, no session timeouts, no "your container was reclaimed" at 2am. The closest thing to owning the machine without buying the rack.

Subscription that feels like ownership — not a metered tab.

/WORKS WITH

Not just another ship. The dock they all tie up to.

InferHaven speaks the OpenAI schema natively, so the IDE plugins, CLIs, and agents you already use talk to it with a one-line endpoint swap. It isn't competing for your editor — it's the harbor they all dock in: your models, your machine, every tool pointed at the same local gateway. Privacy by default, security baked in.

/swap one line

# point any OpenAI SDK at your local InferHaven
export OPENAI_BASE_URL="http://localhost/v1"
export OPENAI_API_KEY="haven"  # local server, any string works
/any + OpenAI-compatible client

/WHERE IT FITS

InferHaven is the box, not another tool in it.

Most things you'd compare it to actually run on top of it. The only real question is whether you build the stack yourself, or boot it whole.

/THE STACK, ASSEMBLED

  • Inference runtime Ollama, out of the box — or swap in vLLM or LocalAI if you'd rather.
  • Private gateway OpenAI-compatible API on localhost. Nothing listens on a public interface unless you add it yourself.
  • Workspace SSH + mosh, tmux with auto-save, your dotfiles, root on your own machine.
  • Browser IDE Full VS Code in the browser via code-server.
  • Coding assistants Continue, Aider, Cline, Goose, OpenCode, Avante and more — installed and pre-pointed at your models.
  • Edge Caddy reverse proxy with automatic HTTPS.
  • Fleet control plane Optional — provision and manage remote GPU boxes from one dashboard with InferHaven Cloud.

Each of these is something people self-host on its own. InferHaven is all of them — assembled, wired together, and kept in sync.

Roll your own

Build it yourself

  • Hand-wire Ollama, a web UI, and each assistant's config
  • Re-edit every assistant's endpoint on each new model
  • Own the HTTPS, SSH, tmux, and backups yourself
  • It's all yours to debug at 2am
InferHaven

Boot it whole

  • One command — docker compose up -d — and the whole stack is live
  • Assistants auto-pointed at your local models, re-synced on every pull
  • HTTPS, SSH, tmux, and backups handled for you
  • Same exit door as the front door — it's just Docker

/SEE IT RUN

One command. The whole stack.

From git clone to the first model response in ~3 minutes. Point any OpenAI-compatible tool at localhost/v1. Your PC / laptop is the model server.

haven@laptop · ~/inferhaven-core
haven@laptop $ git clone https://github.com/InferHaven/inferhaven-core
Cloning into 'inferhaven-core'... done.
haven@laptop $ cd inferhaven-core && cp .env.example .env
haven@laptop $ docker compose build workspace
[+] Building workspace image locally... done ✓
haven@laptop $ docker compose up -d
[+] Running 4/4
✓ ollama Started
✓ workspace Started
✓ code-server Started
✓ caddy Started
haven@laptop $ ssh -p 2222 haven@localhost
Welcome to InferHaven · tmux session 'Haven' restored
haven@haven $ haven pull qwen2.5-coder:14b
pulling manifest... downloading 8.5GB ✓
haven@haven $ curl http://localhost/v1/chat/completions \
$ -d '{"model":"qwen2.5-coder","messages":[{"role":"user","content":"hi"}]}'
{"choices":[{"message":{"role":"assistant","content":"Hello, world."}}]}
haven@haven $ haven chat qwen2.5-coder:14b
> write a tiny haiku about safe harbors
Anchor holds the line,
lighthouse cuts the storm in two —
code sleeps in the bay.

/TRY IT

3 portholes, running the same InferHaven behind each.

No account, no credit card, no email gate. Pick the path that's least friction, and be inferencing in under two minutes.

/RESERVE YOUR WORKSTATION

Be first to boot
your private workstation.

A persistent dev environment with optional GPU, in the region you choose. No credit card to reserve. We email you when your tier opens — founding members get our best early-adopter pricing, and we'll always give advance notice before any change.

See tentative tier breakdown ↓

  • Your workspace, your data — encrypted at rest, never used to train models
  • Bring your own LLM keys (OpenAI, Anthropic) for hybrid local + remote flows
  • Resume your session in seconds, No session lockouts.

/PRICING · TENTATIVE

Flat monthly. Like owning the machine.

Start free, then pick the GPU power you want. No per-token bills, no minute-by-minute rental — your workspace stays yours.

Pricing below is tentative — final tier names and numbers may shift before billing opens in beta. Reserve a spot for founding-member pricing: we'll honor the most generous rate we can sustain and always give advance notice before any change.

no charge until upgrade
FREE TRIAL
$0 14 days

Try a real workstation with a tracked GPU-hour budget. We verify your card to prevent abuse but never bill until you choose to upgrade. Convert any time to keep your work; let it expire and your workspace archives for 30 days.

  • Full Workspace + 7 GPU hours (tracked live)
  • GPU-hour meter visible in dashboard
  • 25 GB temporary storage
  • Bring-your-own LLM keys
  • Upgrade keeps your environment intact
$ join waitlist
WORKSPACE
$19 / month

A persistent dev environment with secure remote access, durable storage, and a gateway for your own LLM keys. No GPU included.

  • Persistent home + secrets vault
  • SSH + browser IDE (code-server)
  • 50 GB durable storage
  • Remote LLM gateway (your OpenAI / Anthropic keys)
  • Custom domain support
/reserve workspace
$29 / month

More room and a second seat for solo devs with bigger projects or small teams sharing a single environment. Same persistent dev workstation, more headroom.

  • Everything in base Workspace
  • 100 GB durable storage
  • 2 included seats
  • Higher LLM-gateway request budget
  • Priority support response
/reserve workspace+
GPU
$59 / month

Modest GPU access for local coding models, code assistants, embeddings, and occasional inference. Generous fair-use.

  • Everything in Workspace
  • Shared GPU pool — curated catalog models, no per-token bills, no session timeouts
  • 7B–14B local models, common harnesses
  • Generous fair-use; we only throttle clearly non-interactive abuse
  • 100 GB durable storage
/reserve GPU Personal
$149 / month

Stronger GPU access with priority scheduling for larger local models and heavier coding assistant workflows.

  • Everything in GPU Personal
  • Priority scheduling — higher-tier GPU
  • 32B+ models, large context windows
  • 250 GB durable storage
  • Multi-user workspace (up to 3 seats)
/reserve GPU Studio
DEDICATED
$399 / month

Reserved single-tenant GPU — 20 GB VRAM, your machine alone. Comfortable home for 14B local models or 32B with Q4 quant. No sharing, no queue, no per-token meter.

  • 20–24 GB VRAM, single-tenant
  • No sharing, no queue, no per-token meter
  • Unlimited fair-use — your machine, your rules
  • 500 GB durable storage, daily encrypted backups
  • Audit logs + RBAC + custom domain support
/reserve dedicated
$799 / month

Reserved single-tenant GPU with 48 GB VRAM. Run 32B at Q8, 70B at Q4, large context windows. The ceiling for serious local-model AI dev work.

  • Everything in base Dedicated
  • 48 GB VRAM, single-tenant
  • 32B at Q8, 70B at Q4 comfortable
  • 1 TB durable storage, hourly snapshots
  • SSO + SOC 2 path + white-glove onboarding
/reserve dedicated pro
ENTERPRISE your requirements
Custom / contact us

For teams with specific compliance, data-residency, or hardware needs — regional single-tenant placement, bare-metal isolation, 80 GB+ class GPUs, and negotiated SLAs. We scope it to you.

  • Regional data-residency + single-tenant isolation
  • 80 GB+ class GPUs (A100 / H100 on request)
  • SSO, audit logs, RBAC, custom SLA
  • Dedicated onboarding + private support channel
  • Annual or committed-use pricing
/contact sales
BYO-COMPUTE your cloud, our stack
$79 / server / month

For teams that already have AWS, GCP, Hetzner, or on-prem GPU capacity. We install our agent, run your InferHaven dashboard, monitor and update. You pay your own cloud bills direct — we just manage.

  • Run on your AWS / GCP / Hetzner / bare-metal
  • We install, configure, monitor, update
  • Full InferHaven dashboard + agent included
  • Same SLA-grade support as Dedicated
  • Billed flat per managed server
/contact for BYO setup

All cloud tiers include daily backups, encrypted storage, your data stays in the region you pick (GPU regions roll out through beta), and direct support is included on every tier. Fair-use means no per-token bills and no session timeouts — generous, with throttling only for clearly non-interactive abuse. If you outgrow Studio, Dedicated is built for you. common questions ↓

/FAQ

The questions we get most.

  • What exactly do I get with InferHaven Cloud?

    A persistent private workstation — like a dev machine in the cloud that's yours. SSH in, write code, run your tools, pick up the same tmux session tomorrow. Optional GPU tiers add local model inference on top.

  • How does the free trial work?

    Sign up with no charge — we verify your card to prevent abuse but don't bill until you upgrade. You get 14 days with a full Workspace plus 7 GPU-hours of compute, tracked live in your dashboard so you always know what's left. Upgrade any time to convert your environment to a paid tier — your code, secrets, and dotfiles carry over untouched. If you don't upgrade by day 14, the workspace archives for 30 days; resume in that window or it's purged. Note: hourly GPU limits and trial allotments are tentative and may change before beta release.

  • Is the GPU unlimited?

    Shared GPU tiers run on a pool with generous fair-use — no per-token bills and no session timeouts, unlike metered or session-limited services. We keep it fast by fair-scheduling under load, and only throttle workloads that are clearly non-interactive (24/7 max-throughput batch or training). Dedicated tiers are single-tenant — your card, genuinely unlimited. Workspace has no GPU, for cloud-model users.

  • Am I billed per token, per minute, or per request?

    No. Flat monthly subscription. You're paying for the workstation, not the compute meter. The only variable cost is anything you opt into yourself, like an OpenAI or Anthropic API key you choose to wire up.

  • What happens to my workspace if I pause my subscription?

    Paid tiers: your home directory, secrets, and storage are preserved for 90 days while paused — life happens, and we don't want a hard month to cost you your environment. Resume anytime in that window and pick up exactly where you left off. Free trial workspaces archive for 30 days after expiry. After the retention window, data is purged unless you've exported it.

  • Where does my data live?

    You pick the region at signup. Storage is encrypted at rest; the network is inside our VPC and never exposed to the public internet by default. GPU availability expands by region through beta, so some GPU tiers start in a subset of regions. Your code is yours — we don't train on it, scan it, or share it.

  • How is this different from a Codespaces / cloud IDE?

    The IDE is just one surface. Underneath, you get a real persistent workstation: SSH, tmux, root, your dotfiles, your tools, your shell. You're not in someone else's sandbox — you're in your own machine.

  • Why not just run Ollama and Continue myself?

    You absolutely can — and if you do, you've built the first two rooms of what InferHaven ships whole. We're Ollama plus a terminal-first workspace, a browser IDE, auto-HTTPS, SSH key management, and up to ten coding assistants installed and pre-pointed at your local models — re-synced automatically every time you pull a new one. The DIY stack is a weekend of wiring and a maintenance tab that never closes; this is `docker compose up -d`. Same parts, assembled, kept in tune.

  • How is this different from Tabby or a self-hosted Copilot?

    Tabby and friends are excellent at one job: code completion from a self-hosted model. InferHaven isn't trying to be a better completion engine — it's the whole machine that engine runs on. You get the model server, the workspace, the IDE, and your pick of assistants (Continue, Aider, Cline, Goose, and more) wired up at once, so you can switch between them or run several at once. If all you want is autocomplete, Tabby is a clean choice. If you want a private coding box you live in, that's us.

  • Do local models actually keep up with the cloud ones?

    Honestly? For the hardest reasoning, the frontier cloud models are still ahead — usually by a few months. But for day-to-day coding, Qwen 2.5 Coder, DeepSeek, and Llama on right-sized hardware are genuinely good, and the gap closes every release. And InferHaven isn't local-only: wire up your own Anthropic or OpenAI key and route the gnarly refactor to the cloud per-task, while everything sensitive stays on your box. You pick what leaves, and when.

  • Can I run my own models, or do I have to use yours?

    Your hardware tier, your call. Pull any model the GPU class supports (Qwen 2.5 Coder, Llama 3, DeepSeek, etc.). The OpenAI-compatible gateway speaks the same protocol so every harness -- Cursor, Claude Code, Continue, Aider, OpenCode, and more -- works without reconfig.

  • What about the self-hostable version?

    InferHaven Core is the same stack, self-hostable on any machine with Docker. The Cloud product is built ON Core — you can move between them without rewriting anything. Many users will run Core on a home GPU and the Cloud workstation for roaming or specific access needs.

  • Is InferHaven open source? What license is it under?

    InferHaven Core ships under FSL-1.1-Apache-2.0 (Functional Source License). The code is on GitHub and you can read, run, modify, and self-host it today — but it's source-available, not OSI open source. Each release converts to Apache-2.0 automatically two years after publication. FSL restricts one thing: building a managed-hosting service that competes head-on with InferHaven Cloud during that 2-year window. Everything else — personal use, internal commercial use, modification, consulting — is permitted.