Consume inference from a local endpoint (CLI)
Use the halo CLI to run a local OpenAI-compatible endpoint that pays per request from your wallet — deposit once, bill actual token usage, with spend guards.
The consume role runs a local OpenAI-compatible endpoint that pays per request
from your wallet — so any OpenAI client gets inference with no provider API keys in
its code. This guide uses the halo CLI directly. Prefer to let an agent handle
it? See consume with your agent.
Halo is in alpha on Base mainnet with real USDC. Requires Node.js 20+.
Install the CLI
bash <(curl -fsSL https://raw.githubusercontent.com/warden-protocol/run-halo/main/skill/scripts/install.sh)
halo doctor --json # node version, install + wallet state, provider, endpoint + relay health
Configure and run the endpoint
# 1. one-time: wallet + a persisted consumer profile so `consume` needs no flags.
# (setup wants a --provider slug even for pure consume; openai is a fine placeholder.)
halo setup --provider openai --consume --consume-model gpt-4o-mini \
--consume-allow "gpt-4o-mini,meta-llama/llama-3.1-8b-instruct" \
--consume-max-usdc 0.05 --consume-port 8799
# 2. fund the printed wallet with USDC on Base mainnet (this pays for inference),
# plus a little ETH on Base for the vault deposit gas.
# 3. run the endpoint. --vault bills actual token usage; --vault-deposit funds it
# and auto-refills mid-run so the endpoint never drops off the rail.
halo consume --vault --vault-deposit 5
# endpoint : http://127.0.0.1:8799/v1
Call it like any OpenAI endpoint
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8799/v1", api_key="halo") # api_key unused unless --api-key set
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Summarize Base mainnet in one sentence."}],
)
print(resp.choices[0].message.content)
Billing and guardrails
Vault mode (--vault) is the recommended billing rail — pair it with
--vault-deposit <usd> so the endpoint funds itself and auto-refills mid-run. It
bills the actual tokens each request used (deposit once, settle per real usage),
which lines up with margin-priced operators so you pay real per-model cost rather
than a flat quote. Apart from the deposit/withdraw gas, settlement gas is sponsored
by Halo.
Key guards:
--max-usdc <n>— per-request ceiling.--budget-usdc <n>— cumulative cap for the run.--consume-allow— model allowlist.--confidential— route only to TEE operators and end-to-end-encrypt the prompt to the enclave.
Keep it always-on
Don’t foreground-launch the daemon under an agent or gateway (a gateway restart kills its children). Install it as an OS service:
halo service install consume -- --vault --vault-deposit 5
halo service status consume
halo service logs consume
Related
- Serve inference and earn: run an operator (CLI).
- Monitor from the web: pair with the dashboard.
- Full CLI reference: warden-protocol/run-halo.