Running LLMs Locally Is Quietly Becoming the Better Option for Home Devs

For the last two years, “use a local LLM for coding” was a recommendation with a quiet asterisk. The asterisk said: if you’re willing to accept noticeably worse results. That asterisk has gotten a lot smaller. The open-weights models that landed across late 2025 and early 2026 are good enough that, for a real chunk of the work a home developer does in a day, the question is no longer “can I get away with running this locally” but “why am I still paying per token for this.”

This post is about the stack that actually makes that swap practical, the open models worth pointing your harness at, and why I think the long-term direction for home use is local-first. Not because the cloud models will get worse, but because their economics are quietly going the wrong way for individuals.

The pieces, briefly

A working local-LLM setup has three layers, and it helps to think about them separately:

A runner that loads model weights, manages memory, and exposes them over an OpenAI-compatible API. (“OpenAI-compatible” means the runner speaks the same HTTP protocol as OpenAI’s API, so any tool built for OpenAI can point at this URL on your laptop instead and just work.) LM Studio is what I use. Ollama is the other obvious pick. Both will host any model you download in the standard local formats (GGUF, the universal one, or MLX, Apple’s faster-on-Apple-Silicon variant) and serve it on localhost.
A harness that turns a chat model into a coding agent: file editing, shell access, codebase context, multi-step tool use. This is where OpenCode and pi.dev live.
The models themselves, which you swap in and out depending on the task.

The thing that makes 2026 different from 2024 is that all three layers got good at the same time. You can mix and match without writing glue code.

The harnesses

OpenCode

OpenCode is the bigger, more featureful of the two. It is open-source, has racked up around 150K GitHub stars, and runs in three places (terminal, desktop app, and IDE extension), sharing sessions across them. It supports something like 75+ model providers through its Models.dev integration, including local OpenAI-compatible endpoints, so an LM Studio server slots in as just another provider. It explicitly does not store your code or context anywhere, which matters if you are pointing it at a private repo.

Feature-wise it leans full-stack: LSP integration so the model gets real language-server context, multi-session parallelism (you can have several agents working on the same project), and shareable session links for debugging or asking someone to look at what your agent did.

If you want a “Claude Code-style experience but BYO model,” this is the one.

pi.dev

pi.dev is the deliberate counter-pitch. It bills itself as a “minimal terminal coding harness,” built by Mario Zechner, and the design philosophy is on the opposite end from OpenCode. It is small, extension-based, and explicitly does not ship with most of the features that have piled into modern coding agents over the last year: no MCP plugin protocol, no sub-agents spawning sub-agents, no permission popups, no plan mode, no built-in todo tracking, no background shell tasks. The argument is that you should add those yourself via extensions if you want them, rather than inheriting opinionated defaults.

In return you get a smaller footprint, fast mid-session model switching (/model or Ctrl+L), tree-structured sessions you can rewind to any prior point, and 15+ provider integrations including local endpoints.

It is the right tool when you want the agent to do exactly what you tell it and nothing else, which, increasingly, is when I want a coding agent at all.

Where the others fit

A few honorable mentions worth knowing about:

Claude Code. Anthropic’s own CLI. Officially points at Claude models, but with a small proxy in front (LiteLLM, claude-code-router, or similar) you can redirect it at any OpenAI-compatible endpoint, including a local LM Studio or Ollama server. That makes it a viable third harness even when you are not using Claude itself, which is useful since its agent loop and tool-use behavior are excellent.
Aider. The elder statesman. Git-native (every change is a commit), works with anything OpenAI-compatible, very small surface area. Still excellent.
Cline / Continue. VS Code extensions, for people who want the agent inside the editor rather than a separate terminal.

Models worth pointing your harness at

The hard part of the local-LLM story used to be the model lineup. That is no longer true. The names below all live on Hugging Face (the de facto registry for open models) and download into LM Studio or Ollama with one click. Some current picks, by job:

Qwen 3.6-27B. The default I reach for. Released April 2026 and notable for landing 77.2% on SWE-bench Verified despite being a dense 27B. Reasonable VRAM footprint (~18 GB at Q4), broad language coverage, 256K native context. On a higher-RAM Apple Silicon machine, this is your daily driver.
DeepSeek V3.2. DeepSeek’s December 2025 release. A Mixture-of-Experts model with 37B active parameters that punches well above its active count for reasoning-heavy work. Algorithmic problems, data-pipeline logic, anything where the model needs to actually think rather than pattern-match. Big on disk, but the MoE design keeps inference cost closer to a mid-size dense model.
Llama 4 Scout. The context-window monster (10M tokens, 17B active / 109B total MoE). Useful when you need to feed it an entire mid-sized codebase in one shot. The headline number is large enough that you stop thinking about chunking strategies.
Codestral 25.08. Mistral’s coding model, tuned for fast inline completion rather than long-form agent work. 256K context, great for autocomplete-style integrations.
Qwen 3.5-9B. The quality-per-gigabyte champion for the lighter end. Comfortable on a 16–24 GB Mac and a great place to start if your machine isn’t a Pro tier.

The honest thing to say about all of these: none of them is quite GPT-5 or Claude Opus on the hardest end of the difficulty curve. They are noticeably closer than they were a year ago. For “refactor this file,” “write tests for this module,” “find the bug in this trace,” they are now in the same neighborhood as the frontier models, and they cost zero per token.

The one place cloud still wins

The honest concession is narrow: the absolute hardest reasoning still belongs to the frontier. When I have a gnarly architectural problem or a bug that needs a model to actually hold the whole system in its head, Claude Opus or GPT-5 are still the right call. That gap is closing on a quarterly cadence, but it is not closed.

Everything else has flipped. The 8-hour-a-day grind of “refactor this, write tests for that, find the bug in this trace” runs locally now, for the cost of electricity, on hardware that does not phone home, with no quota emails. The frontier providers have spent the last year quietly tightening usage caps and spawning new pricing tiers, and the trajectory is not subtle. Local doesn’t have a meter.

The workable pattern is a mix. Point the harness at a local model by default. Hit Ctrl+L and swap to Claude when the task earns it. One keystroke, no friction.

My setup

For what it’s worth, here is what I run on a daily basis:

MacBook Pro with M5 silicon and a generous amount of unified memory. The kind of machine that makes the 20–30B-parameter open models feel snappy.
LM Studio as the model runner, exposing an OpenAI-compatible endpoint on localhost.
OpenCode, pi.dev, and Claude Code as the three harnesses, depending on the job. OpenCode for full-feature work, pi.dev when I want a minimal footprint, and Claude Code (pointed either at Anthropic’s API for hard problems or at the local LM Studio endpoint via a proxy) when I want its agent loop with an open model behind it.

In a typical session I bounce between all three harnesses and several models without thinking about it. Most of the work happens locally. Claude (the actual hosted model) shows up when I have a hard problem and want a frontier second opinion. The bill at the end of the month is a small fraction of what it would be if I ran everything through the API.

Why I think this becomes the default

The thing nobody quite wants to say out loud about cloud LLM pricing is that the prices we got used to in 2024 were marketing prices, and the prices we are getting in 2026 are the actual prices. Usage caps have tightened. Per-token rates on the strongest models have crept up. “Pro” tiers have spawned “Max” tiers have spawned “Enterprise” tiers, each one repositioning the previous one as “the cheap one.”

That is not a conspiracy. Running these models is genuinely expensive, and someone has to pay for the GPUs. But it does mean that the equation for individual home developers is steadily shifting. The cost of running a top-tier cloud model 8 hours a day keeps going up. The cost of running a top-tier open model 8 hours a day is the same as running your laptop.

Meanwhile, the open models are getting better on a steady cadence. Every three to six months a new generation lands, the leaderboards reshuffle, and the gap to the frontier shrinks a little more. The hardware is getting cheaper and faster on roughly the same cadence. And the harnesses that connect the two are now mature enough that swapping models is a config change, not a project.

I do not think the frontier labs are going anywhere. I think they will keep building the absolute best models, and businesses with real budgets will keep paying for them. But for home developers, for the kind of work that fills the average day, I think the right answer five years from now is going to look a lot more like your laptop, an open model, and a small harness that gets out of your way than like a perpetual subscription to whatever cloud provider had the best benchmark this quarter.

The pieces to do that today already exist. You just have to assemble them.