Serve local models on your GPU (Llama, Ollama, LM Studio)

If you have a GPU, you can serve open models straight off your own hardware — no provider account, no API key, no per-token cost to you. Halo already runs local models like llama3.2, qwen3, gemma3, phi3 and deepseek-r1 this way.

This is the local path of what to serve. Prefer to resell a provider API instead? See run an operator.

1. Run a local model server

Use either runtime — both expose an OpenAI-compatible endpoint the halo CLI speaks to:

Ollama — ollama pull llama3.2 then ollama serve.
LM Studio — download a model in the app and start its local server.

Pull a model that fits your GPU (see sizing below), and make sure it responds locally before connecting Halo.

2. Point your operator at it

halo setup --provider ollama --flat 0.20    # or --provider lmstudio
halo serve

halo serve connects outbound to the relay over WebSocket — no public URL and no open inbound port — announces your local models, and serves until stopped. Your operator wallet needs no pre-funding; USDC arrives at settlement and Halo sponsors the gas.

Pricing local models

Local models have no upstream per-token price to mark up, so price them flat:

halo setup --provider ollama --flat <usd-per-1k-tokens>

--flat sets a fixed USD price per 1,000 tokens. Pick a number that beats the cloud APIs for the same model while still paying for your electricity and time. More on this in operator pricing & earnings.

Rough GPU sizing

A model needs to fit in VRAM (quantized weights + context):

~8B models (Llama 3.1 8B, Qwen 8B) — comfortable on ~8–12 GB VRAM.
~4B and smaller (gemma3:4b, qwen3:4b, phi3) — run on modest cards.
30B+ — needs a high-end or multi-GPU setup.

Start with a small, popular model to prove the flow, then scale up.

Keep it running

A local operator earns only while it’s online, so run it as a service:

halo service install serve
halo service status serve

See keep your operator online for the full setup.