OpenClaw, Four Days In: Taming the API Bill
In my last post, I mentioned burning through $12 in API costs in the first twelve hours of running OpenClaw. I treated it as a data point. It was actually a warning.
Four days later: $60. Running Claude Sonnet 4.6 with a 150,000 token context window, every Telegram message — including “what’s the weather?” — was hitting the API at premium rates. Annualized, that’s a meaningful bill for a personal assistant that I wouldn’t trust my kindergartner with.
I needed a cost strategy before this became a habit.
Layer One: The Obvious Fixes
Before reaching for local inference, I stacked three changes that should be the first moves for anyone running OpenClaw on cloud APIs.
Model tiering. Switched the default from Sonnet to Haiku — roughly 10x cheaper per token. OpenClaw supports a fallback chain, so if Haiku hits a rate limit or fails, it escalates to Sonnet automatically. Most requests never need Sonnet.
Prompt caching. Enabled cacheRetention: "short" on both models. Anthropic cache reads cost about 90% less than fresh input tokens. If your conversations have any repetition in the system prompt or history — and OpenClaw’s do — this is nearly free savings.
Context window cap. Dropped from 150,000 to 40,000 tokens. Every message was sending the model a novel-length context. Most conversations don’t need that. Cutting it down to 40k costs nothing in practice and saves significantly on token spend.
These three changes alone cut costs roughly 30-40x. That should have been enough. But at that point I was curious whether I could route routine queries to a local model entirely and save the API calls for work that actually needed Claude.
The Hardware
Arthur — the repurposed desktop I set up for OpenClaw — has a NVIDIA GeForce GTX 1070 Ti sitting in it. It’s a 2017 gaming GPU. No tensor cores, no INT8 acceleration, nothing designed for inference. But it has 8GB of VRAM and CUDA support, which is enough to run a quantized 7B parameter model.
sudo apt install nvidia-driver-580
curl -fsSL https://ollama.com/install.sh | sudo sh
Ollama detected the GPU automatically, runs as a systemd service, auto-starts on boot. The install was genuinely uneventful.
Choosing a Model
With 8GB VRAM, you’re limited to roughly 7B parameter models at Q4 quantization. The candidates I considered:
- Qwen 2.5 7B — strongest tool-calling ability at this size, which matters because OpenClaw needs the model to invoke tools correctly
- Mistral 7B v0.3 — solid general model, weaker at structured tool calling
- Llama 3.1 8B — good general quality, less reliable tool use
I went with Qwen 2.5 7B, specifically qwen2.5:7b-instruct-q4_0. The Q4_0 quantization is slightly lower quality than Q4_K_M but faster on older hardware.
ollama pull qwen2.5:7b-instruct-q4_0
4.4GB on disk, 4.6GB in VRAM. Fits comfortably.
The Context Window Problem
This is where it got interesting. OpenClaw declares the model’s context window in the provider config, and the 1070 Ti’s performance is extremely sensitive to that value.
| Context Window | VRAM Usage | Response Time | Result |
|---|---|---|---|
| 8,192 tokens | ~4.6GB | ~4 seconds | Fast, but OpenClaw flagged it as low |
| 15,000 tokens | ~5.0GB | ~6 seconds | Below OpenClaw’s 16k minimum — blocked entirely |
| 16,384 tokens | ~5.1GB | ~8 seconds | Sweet spot |
| 32,768 tokens | ~5.8GB | 18-25 seconds | Timeouts — OpenClaw killed requests before completion |
The 16k minimum was a surprise. Set it to 15,000 and OpenClaw silently refused to use the model at all, throwing FailoverError: Model context window too small (15000 tokens). Minimum is 16000. The error didn’t surface to the user — the bot just stopped responding. I had to dig through logs to find it:
sudo cat /tmp/openclaw/openclaw-2026-04-09.log | grep "blocked model"
Same lesson as the billing failure in the first post: when the bot goes silent, check the gateway logs. The error messages are informative, but they’re buried.
The Config Corruption Incident
At one point I tried to update the config remotely by piping it through SSH:
ssh claw@arthur "cat ~/.openclaw/openclaw.json" | python3 -c "..." | ssh claw@arthur 'cat > ~/.openclaw/openclaw.json'
The SSH pipe failed mid-write and zeroed out the config file. OpenClaw’s config health monitor caught it — logged size-drop-vs-last-good:2923->0 and flagged it as suspicious — but the gateway was already in a bad state by then.
The fix: never pipe through SSH to overwrite a config file on a remote machine. Run the script entirely on the remote side instead:
ssh claw@arthur 'python3 -c "
import json
with open(\"/home/claw/.openclaw/openclaw.json\") as f:
cfg = json.load(f)
cfg[\"models\"][\"providers\"][\"ollama\"][\"models\"][0][\"contextWindow\"] = 16384
with open(\"/home/claw/.openclaw/openclaw.json\", \"w\") as f:
json.dump(cfg, f, indent=2)
"'
Read and write locally on the machine. Don’t trust SSH pipe integrity for config writes.
Wiring Ollama Into OpenClaw
A few things worth knowing about the integration that aren’t obvious from the docs:
Use the native Ollama API at http://127.0.0.1:11434, not the /v1 OpenAI-compatible endpoint. The /v1 path breaks tool calling. Set "api": "ollama" explicitly in the provider config. Set costs to zero so the usage tracker correctly shows $0 for local inference.
The fallback chain handles the rest automatically: if Ollama times out or fails, OpenClaw escalates to Haiku, then Sonnet. The config ended up looking like this:
{
"agents": {
"defaults": {
"model": {
"primary": "ollama/qwen2.5:7b-instruct-q4_0",
"fallbacks": ["anthropic/claude-haiku-4-5", "anthropic/claude-sonnet-4-6"]
},
"models": {
"ollama/qwen2.5:7b-instruct-q4_0": { "alias": "local" },
"anthropic/claude-haiku-4-5": { "alias": "haiku", "params": { "cacheRetention": "short" } },
"anthropic/claude-sonnet-4-6": { "alias": "sonnet", "params": { "cacheRetention": "short" } },
"anthropic/claude-opus-4-6": { "alias": "opus" }
}
}
},
"models": {
"providers": {
"ollama": {
"baseUrl": "http://127.0.0.1:11434",
"apiKey": "ollama-local",
"api": "ollama",
"models": [{
"id": "qwen2.5:7b-instruct-q4_0",
"name": "Qwen 2.5 7B Q4_0 (local GPU)",
"contextWindow": 16384,
"maxTokens": 4096,
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
}]
}
}
}
}
Model aliases make switching easy in chat — /model haiku, /model local, /model sonnet — without typing full model paths.
What a 7B Model Actually Does
Here’s the honest part.
I gave Claw scoped access to my Nest thermostats via an explicit API token — access to the thermostats and nothing else, same principle as the locked-down claw service account on Arthur. When I asked Qwen to check the thermostat, it responded: “I don’t have access to your home systems, could you provide me with the current temperature?”
It didn’t attempt to use the tool that was sitting right there. When I pushed, it invented a temperature reading rather than admit it couldn’t help.
This is the gap. It’s not a small quality difference between Qwen and Claude — it’s the difference between an agent that can actually do things and a chatbot that talks about doing things. Simple conversation works fine. Anything requiring the model to reason about which tool to call, construct parameters correctly, and act on the result — it struggles or just doesn’t try. Multi-step tool use is largely out.
I kept it as the primary model anyway. For idle chat it’s free and the responses are fast enough. When I need Claw to actually do something — run a script, check a feed, interact with an API — I type /model haiku in Telegram. Eight seconds later I’m talking to Claude. /model local switches back.
The personality difference is real and it’s part of the setup now. Qwen is the default. Claude is on call.
Where the Costs Landed
| Config | Cost per message |
|---|---|
| Sonnet 4.6, 150k context | ~$0.03–0.10 |
| Haiku, 40k context, prompt caching | ~$0.001–0.003 |
| Local Qwen | $0.00 |
The electricity for the 1070 Ti running inference is somewhere around $0.001 per query. Effectively zero.
Most idle conversation now costs nothing. Real work costs pennies via Haiku. Sonnet is there for when I need it.
The real leverage wasn’t the local model — it was the layered strategy. Prompt caching, context window management, model tiering with automatic fallbacks. That’s where the 30-40x cost reduction came from. The local model is the final layer, and it’s optional.
Four days in. Bill under control. Ask me again in a month.