Since Ollama v0.14 (January 2026), Claude Code can connect directly to local open-source models via the Anthropic Messages API compatibility layer — no proxy needed, no API costs. You set two environment variables, pull a model like Gemma 4 or Qwen3-Coder, and run claude --model gemma4. If your hardware is too weak for local inference, OpenRouter provides free-tier access to the same models in the cloud with the identical Claude Code workflow.
Claude Code is Anthropic’s terminal-based AI coding agent. It reads your codebase, edits files, runs commands, and manages multi-step tasks — all from your terminal. Until early 2026, using it required an Anthropic API key and paying per token. That changed on January 16, 2026, when Ollama shipped native Anthropic Messages API support in version 0.14.0.
Now there are two legitimate paths to running Claude Code without Anthropic bills: fully local with Ollama, or cloud-routed through OpenRouter’s free tier. Both use the same Claude Code binary, the same terminal workflow, the same tool-calling interface. The difference is where the large language model runs — on your GPU or on someone else’s.
This guide covers both paths end-to-end, including the real performance trade-offs that most tutorials skip. Because “free” has a cost — and you should know exactly what it is before committing.
How does Ollama’s Anthropic API compatibility actually work?
Ollama v0.14+ exposes a /v1/messages endpoint that mimics the Anthropic Messages API. When Claude Code sends a request expecting an Anthropic response, Ollama intercepts it and routes it to whatever local model you’ve configured. The translation happens transparently — Claude Code doesn’t know (or care) that it’s talking to Qwen3-Coder instead of Claude Sonnet.
The compatibility layer supports multi-turn conversations, streaming, system prompts, tool/function calling, vision (base64 images), and extended thinking blocks. What it does not support: the tool_choice parameter, prompt caching, the batches API, PDF inputs, and URL-based images. Token counts are approximations based on the local model’s tokenizer, not Anthropic’s.
In practice, the most important supported feature is tool calling. This is what makes Claude Code agentic — it’s how the agent reads files, writes code, and runs shell commands. Without tool calling, you get a chatbot. With it, you get an autonomous coding agent. Every model you choose for this setup must support tool calling, or Claude Code degrades to text-only mode.
Which models actually work well with Claude Code locally?
Not every model labeled “coding model” works well inside Claude Code. The agent requires solid tool calling, a context window of at least 32K tokens (64K+ recommended), and coherent multi-turn reasoning. Here’s what the landscape looks like as of April 2026:
| Model | Params (active) | Context | Min VRAM | Tool calling | Best for |
|---|---|---|---|---|---|
| Gemma 4 31B | 31B (31B) | 128K | ~20GB | Native | General coding + vision tasks |
| Gemma 4 26B MoE | 26B (3.8B) | 128K | ~16GB | Native | Speed on consumer GPUs |
| Qwen3-Coder 30B-A3B | 30B (3B) | 256K | ~16GB | Native | Massive context, light hardware |
| Qwen3-Coder 480B-A35B | 480B (35B) | 256K | ~24GB+ | Native | Maximum quality (needs serious HW) |
| GLM-4.7 (cloud) | MoE | 128K | Cloud only | Yes | Free Ollama cloud model |
The sweet spot for most developers in April 2026 is Gemma 4 26B MoE — it activates only 3.8B parameters per inference step despite being a 26B model, which means it pushes ~300 tokens/sec on a Mac Studio M2 Ultra and runs on 16GB VRAM GPUs. For context: Gemma 4 31B Dense scores 80.0% on LiveCodeBench v6 and has a Codeforces ELO of 2150, up from 110 in Gemma 3. That’s not “decent for a local model” — that’s competitive with frontier cloud APIs on pure coding tasks.
Qwen3-Coder is the other strong option. The 30B-A3B variant activates only 3B parameters and achieves performance comparable to models with 10-20x more active parameters, according to Alibaba’s benchmarks. Its 256K native context window is particularly valuable for Claude Code workflows that involve reading large codebases. Not sure which model your hardware can handle? whatmodelscanirun.com will tell you in seconds — select your GPU and it shows the largest model that fits comfortably, including recommended quantizations.
How do you set up Claude Code with Ollama? (Step-by-step)
The full local setup takes about 10 minutes. Here’s every step, no shortcuts.
Step 1: Install Ollama
Download from ollama.com for your OS (macOS, Windows, Linux). It installs as a background service on http://localhost:11434. Verify it’s running:
ollama --version
# Should return v0.14.0 or higher
Step 2: Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash
irm https://claude.ai/install.ps1 | iex
Step 3: Point Claude Code at Ollama
Add these two environment variables to your shell profile (~/.zshrc, ~/.bashrc, or ~/.config/fish/config.fish):
export ANTHROPIC_BASE_URL=http://localhost:11434
export ANTHROPIC_AUTH_TOKEN=ollama
Then reload your shell: source ~/.zshrc (or restart the terminal).
ANTHROPIC_AUTH_TOKEN=ollama?
Claude Code requires an auth token to start. Ollama accepts but ignores it. Setting it to any non-empty string (like “ollama”) satisfies the requirement without exposing real credentials.
Step 4: Pull a model
# Recommended: Gemma 4 (pick the variant for your GPU)
ollama pull gemma4 # 31B dense — needs ~20GB VRAM
ollama pull gemma4:e4b # 26B MoE — runs on 16GB VRAM
# Alternative: Qwen3-Coder
ollama pull qwen3-coder # 30B-A3B — 256K context, ~16GB VRAM
Step 5: Launch Claude Code
cd /path/to/your/project
claude --model gemma4
That’s it. You now have a fully local, zero-cost Claude Code agent. Your code never leaves your machine.
Claude Code sends large system prompts plus your codebase context. For best results, increase Ollama’s context window to at least 32K–128K tokens. Ollama’s cloud models run at full context length automatically. For local models, check Ollama’s context length documentation for how to override defaults.
What if your hardware isn’t powerful enough? OpenRouter as a fallback
Not everyone has a 16GB+ VRAM GPU. If your machine can’t run Gemma 4 or Qwen3-Coder comfortably, OpenRouter gives you access to the same models running on cloud infrastructure — including a free tier with 29 models at zero cost.
OpenRouter works as a model proxy: Claude Code talks to OpenRouter’s API (which speaks Anthropic Messages format), and OpenRouter routes the request to whatever model you’ve selected. The setup is almost identical to Ollama, just with different environment variables.
Step-by-step OpenRouter setup
Get a free API key at openrouter.ai/keys, then add these to your shell profile:
export OPENROUTER_API_KEY="your-openrouter-api-key-here"
export ANTHROPIC_BASE_URL="https://openrouter.ai/api"
export ANTHROPIC_AUTH_TOKEN="$OPENROUTER_API_KEY"
export ANTHROPIC_API_KEY="" # Must be explicitly empty
If you were previously logged into Claude Code with an Anthropic account, run /logout inside a Claude Code session first. Otherwise cached credentials will override your OpenRouter configuration. Also: ANTHROPIC_API_KEY must be empty — not missing, not unset, but explicitly set to an empty string.
Optionally, pin specific models for different task tiers:
export ANTHROPIC_DEFAULT_OPUS_MODEL="qwen/qwen3-coder-480b-a35b-instruct:free"
export ANTHROPIC_DEFAULT_SONNET_MODEL="google/gemma-4-31b-it:free"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="mistralai/mistral-small-3.1-24b-instruct:free"
Or for the simplest approach, use OpenRouter’s automatic free-model router — it picks the best available free model for each request:
claude --model openrouter/free
Verify it’s working by typing /status inside Claude Code — it should show your OpenRouter connection and the selected model.
OpenRouter free tier: what you actually get
As of April 2026, OpenRouter offers 29 free models from Google, Meta, Mistral, NVIDIA, and others. The strongest free coding options are Qwen3-Coder 480B (262K context, state-of-the-art agentic coding), DeepSeek R1 for reasoning-heavy tasks, and Devstral 2 for lighter work. Rate limits on the free tier are typically 20 requests per minute and 200 requests per day per model — enough for moderate coding sessions, but heavy users will hit walls.
Multiple community reports indicate that running more than ~50 requests with free models requires at least $10 in OpenRouter credits. The free tier is genuinely free for light usage, but expect to spend a few dollars for sustained daily work. Still dramatically cheaper than Anthropic’s API.
Ollama vs. OpenRouter vs. Anthropic API: how do they actually compare?
This is the section most guides skip. Here’s an honest comparison based on developer reports and benchmarks from early 2026:
| Dimension | Ollama (local) | OpenRouter (free tier) | Anthropic API (paid) |
|---|---|---|---|
| Cost | $0 (electricity only) | $0–$10/month | $50–$200+/month typical |
| Privacy | 100% local — nothing leaves your machine | Code sent to model provider | Code sent to Anthropic |
| Speed | 15–25 tok/s (consumer GPU), 60+ (M2 Ultra) | Variable, provider-dependent | 60–80 tok/s consistently |
| Edit accuracy | 70–80% (model-dependent) | 70–85% | ~98% (Claude Sonnet/Opus) |
| Multi-file reasoning | Degrades on complex tasks | Depends on model | Strongest available |
| Rate limits | None | 20 req/min, 200 req/day | Tier-dependent |
| Offline | Yes | No | No |
| Hardware needed | 16GB+ VRAM GPU | Any machine with internet | Any machine with internet |
The crucial metric is edit accuracy. Claude Code doesn’t just generate text — it makes targeted edits to specific lines in specific files across your project. Claude Sonnet 4 achieves ~98% accuracy on these edits. Local models drop to 70-80%, meaning roughly 1 in 4 edits may need manual correction. On single-file tasks, local models score within 85-90% of Claude. On multi-file reasoning — understanding how a change in auth.py affects middleware.py and test_auth.py — the gap widens significantly.
This doesn’t mean local setups are useless. It means they’re excellent for certain workflows and weaker at others.
When is running locally actually the right choice?
Local Claude Code with Ollama genuinely shines in these scenarios:
Regulated environments and air-gapped systems. If you work with healthcare data (HIPAA), financial data, government contracts, or any codebase that legally cannot touch external servers, local is the only option. No API provider can guarantee what you need — but a model running on your own hardware can.
Privacy-sensitive codebases. Proprietary algorithms, unreleased products, trade secrets. If your company’s legal team would reject sending code to any third-party API, local Ollama is the answer.
Learning and experimentation. When you’re exploring Claude Code’s capabilities, testing prompts, or learning agentic workflows, burning through API credits is wasteful. Local models let you experiment freely with zero marginal cost.
Single-file tasks and quick edits. For straightforward tasks — refactoring a function, adding error handling, writing tests for one module — local models perform close to cloud APIs. The accuracy gap shows up mainly on complex, cross-file orchestration.
Offline development. On a plane, in a rural area, during an internet outage — Ollama keeps working. For developers who travel or work from locations with unreliable connectivity, this alone can be the deciding factor.
Many experienced developers use both: local Ollama for day-to-day single-file tasks and experimentation, then switch to Anthropic’s API (or OpenRouter with a paid model) for complex multi-file refactors and architecture-level changes. You can switch between them by changing two environment variables — no reinstallation needed.
What are the real limitations you should know about?
Here’s what the promotional tweets don’t tell you:
Hardware costs aren’t zero. A GPU that can run Gemma 4 31B costs $400–$1,500+ depending on whether you buy used or new. A Mac with 16GB+ unified memory runs $1,200+. If you’re paying $50/month for Anthropic’s API, it takes 8–30 months to break even on hardware. “Free” is only free if you already own the hardware.
Token generation speed matters more than benchmarks. Consumer hardware generates 15–25 tokens per second. Anthropic’s infrastructure pushes 60–80. When Claude Code needs to generate a 500-line file, that 3-4x speed difference translates to minutes of waiting vs. seconds. Over a full workday, those minutes compound.
Unsupported API features cause silent failures. The Ollama compatibility layer doesn’t support tool_choice, prompt caching, or the batches API. Claude Code uses some of these features internally. When they’re unavailable, the agent doesn’t crash — it silently falls back to less optimal behavior. You may not even notice the degradation until a complex task fails midway.
Model updates lag behind. When Anthropic releases a new Claude version, it’s available instantly through their API. Local models go through a slower pipeline: the open-source model ships, Ollama packages it, you pull it. There’s always a gap between frontier and local.
Context window inflation is misleading. A model may advertise 256K context, but actually using it requires proportionally more VRAM. On consumer hardware, you may be limited to 32K–64K in practice, even if the model theoretically supports more. Claude Code’s system prompt alone consumes several thousand tokens.
Decision flowchart: which setup is right for you?
Advanced configuration tips that make a real difference
Increase context length for Ollama
By default, some Ollama models run with 4K–8K context windows. Claude Code needs much more. Create or edit a Modelfile to increase it:
# Create a custom model with 64K context
ollama create gemma4-64k -f - <<EOF
FROM gemma4
PARAMETER num_ctx 65536
EOF
# Use it with Claude Code
claude --model gemma4-64k
Project-level configuration (keeps your global shell clean)
Instead of setting environment variables globally, you can configure OpenRouter per-project in .claude/settings.local.json at your project root:
{
"env": {
"ANTHROPIC_BASE_URL": "https://openrouter.ai/api",
"ANTHROPIC_AUTH_TOKEN": "your-openrouter-key",
"ANTHROPIC_API_KEY": ""
}
}
This way, one project can use local Ollama while another uses OpenRouter — no manual environment variable switching.
Use Ollama cloud models for free (no local GPU)
Ollama also offers cloud-hosted models that are free to use and run at full context length. Models like glm-4.7:cloud and minimax-m2.1:cloud run on Ollama’s infrastructure, not your hardware. Configure them like any other Ollama model — the environment variables stay the same.
What about alternatives: Claw-dev and other proxies?
Before Ollama added native Anthropic API support, developers built proxy tools to bridge the gap. Claude-code-router and Claw-dev are the most known. These intercept Claude Code’s API calls and re-route them to local models.
As of April 2026, these proxies are largely unnecessary for most users. Ollama’s native compatibility is simpler, more stable, and maintained by the Ollama team. The only remaining use case for proxies is if you need custom request transformation (rate limiting, logging, or routing different request types to different models) that Ollama doesn’t support natively.
What does all of this mean in practice?
The narrative of “Claude Code for free — no compromises” is marketing, not engineering. The real picture is more nuanced and, in many ways, more interesting.
What’s genuinely new is that the barrier to entry for agentic coding has collapsed. A year ago, experimenting with terminal-based AI agents required an API key and a credit card. Today, a developer with a decent GPU can run the same workflow — same tool calling, same file editing, same shell integration — at zero marginal cost. That’s a significant shift for education, experimentation, and privacy-sensitive work.
What hasn’t changed is that model quality still scales with compute. Claude Opus and Sonnet running on Anthropic’s infrastructure remain meaningfully better at complex, multi-file reasoning than any local alternative. The 70-80% edit accuracy of local models vs. 98% for Claude means the “free” setup creates more work for the developer on hard tasks — reviewing, correcting, and re-prompting where the paid version would have gotten it right the first time.
The pragmatic approach for most developers: start local, escalate when needed. Use Ollama for everyday single-file tasks, learning, and privacy-sensitive work. Switch to a paid API for the 20% of tasks where accuracy is non-negotiable — large refactors, unfamiliar codebases, architecture changes. Two environment variables, and you’re on the other backend. No reinstallation, no workflow change.
The tools are here. The models are good enough for real work. The choice isn’t binary — it’s a spectrum, and smart developers will use all of it.
FAQ
The software is free — Ollama is open-source and Claude Code doesn’t charge for the CLI tool itself. However, you need hardware capable of running a large language model locally (16GB+ VRAM GPU or Apple Silicon with 16GB+ unified memory). If you already own suitable hardware, the marginal cost is electricity only. If you need to buy hardware, factor in $400–$1,500+ for a capable GPU. Ollama’s cloud models and OpenRouter’s free tier are genuine zero-cost options, but come with rate limits.
As of April 2026, Gemma 4 26B MoE offers the best speed-to-quality ratio for most developers — it activates only 3.8B parameters per inference, runs on 16GB VRAM GPUs, and achieves competitive coding benchmarks. Qwen3-Coder 30B-A3B is the best choice if you need a massive context window (256K tokens). For developers with high-end hardware (24GB+ VRAM), Gemma 4 31B Dense or Qwen3-Coder 480B provide the strongest raw performance.
The practical minimum is 16GB VRAM for models like Gemma 4 26B MoE or Qwen3-Coder 30B-A3B (both use Mixture-of-Experts to keep active parameters low). For the best experience with full-size dense models, 20–24GB is recommended. Apple Silicon Macs with 16GB+ unified memory also work well. Check whatmodelscanirun.com for GPU-specific recommendations.
Yes. Once you’ve pulled the model with ollama pull (which requires internet), all subsequent inference runs entirely on your machine with no network connection needed. This makes it suitable for air-gapped environments, travel, and situations with unreliable internet. Note that Claude Code itself may check for updates on launch, but the core functionality works offline once the model is cached.
On single-file tasks — refactoring, writing tests, adding error handling — local models score within 85-90% of Claude. On complex multi-file reasoning, the gap widens: Claude achieves ~98% edit accuracy versus 70-80% for best local alternatives. The difference is most noticeable on large codebases with many interdependent files. For many common development tasks, local models are entirely adequate.
Ollama local models run on your GPU and never send data externally. Ollama cloud models (like glm-4.7:cloud) are hosted on Ollama’s infrastructure — free to use but your prompts leave your machine. Both use the same environment variables and Claude Code configuration. Cloud models automatically run at full context length and don’t require a powerful GPU.
Yes. The switch requires only changing two environment variables (ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN). You can even configure different backends per project using .claude/settings.local.json files, so one project uses local Ollama while another uses OpenRouter or Anthropic’s API — no manual switching needed.