Getting Started with Local LLMs on a Mac: OpenCode + LM Studio

If you have an Apple Silicon Mac with a reasonable amount of unified memory and you have ever flinched at a token bill, this post is for you. The goal here is narrow: get a real coding agent (OpenCode) talking to a real local model server (LM Studio) in about twenty minutes, and have it feel good enough that you actually keep using it.

I am going to skip the philosophy. I wrote a longer piece on why local-first is quietly winning for home developers; this one is the how.

What you are building

Two processes on your laptop talking over localhost:

LM Studio. Runs the model, exposes an OpenAI-compatible HTTP API on a local port.
OpenCode. A coding agent that points at that API instead of a cloud provider.

That is it. No proxies, no Docker, no API keys. The whole thing fits in a single arrow: OpenCode → http://localhost:1234/v1 → LM Studio → your GPU.

Step 1: Install LM Studio

Grab LM Studio from lmstudio.ai. It is a normal Mac app: drag it to Applications and launch it. On Apple Silicon it will automatically prefer MLX builds of models when available. MLX is Apple’s machine-learning framework tuned for Apple Silicon’s unified-memory GPUs, and on M-series Macs it is often noticeably faster than llama.cpp (the cross-platform alternative) for the same model weights.

A few settings worth knowing about up front:

Discover tab is where you download models. Search by name, pick a quantization (more on this below), hit download.
Developer tab → Local Server is where you start the OpenAI-compatible endpoint. Click Start Server. Default port is 1234. Leave Just-in-Time Model Loading on; it lets clients request any downloaded model by name and LM Studio will swap it in.
Settings → Runtime lets you confirm MLX is installed. On M-series Macs you want it.

Once the server is running you should be able to hit it from a terminal:

curl http://localhost:1234/v1/models

You will get back a JSON list of whatever models you have downloaded. If that works, the server half is done.

Step 2: Pick a model that fits your machine

This is the step people get wrong, so it is worth being concrete. The constraint on a Mac is unified memory, not “VRAM” in the PC sense; the GPU and CPU share the same pool. A rough rule: leave 8 GB for the OS and apps, and let the model use the rest.

Some sane starting points by RAM tier (all available in LM Studio’s Discover tab, all packaged as either MLX or GGUF, the two model file formats local runners care about):

16 GB Mac. Start with Qwen 3.5-4B at Q4 quantization. Around 3–4 GB resident, leaves plenty of headroom for your IDE and browser. Surprisingly capable for its footprint, and the most “it just works” pick for the lighter end. Codestral 25.08 is a good second option if you mostly want fast inline completion.
24–32 GB Mac. Qwen 3.5-9B at Q5 or Q6. Around 7–9 GB resident. Strong general coder, fast on M3/M4/M5, and the quality jump over the 4B-class models is the single biggest one in the lineup.
48–64 GB Mac. Qwen 3.6-27B at Q4 or Q5. Around 17–22 GB resident. This is the current sweet spot for daily-driver work on a typical “Pro” machine. Qwen’s April 2026 release lands at 77.2% on SWE-bench Verified, which is wild for a dense 27B. Pair it with DeepSeek V3.2 when you want a reasoning-heavy second opinion.
96 GB+ Mac. You can run Llama 4 Scout (17B active / 109B total, 10M-token context) comfortably and start playing with the larger MoE models like Qwen 3.6-35B-A3B. At this tier you stop thinking about the constraint and start thinking about which model you want.

If in doubt: download Qwen 3.6 (or 3.5 if you are on the lighter end) in whatever the largest quantization is that fits comfortably in your RAM minus 8 GB. It is the closest thing to a default right answer right now.

A note on quantization: Q4 through Q6 are the useful range. Q4 is smaller and faster, Q6 is closer to the original weights. The quality difference is real but smaller than people think; the size difference is large. Start at Q4, move up if you have headroom.

Step 3: Install OpenCode

OpenCode ships as a single binary. The fastest path on macOS:

brew install opencode

Or use the universal installer (curl -fsSL https://opencode.ai/install | bash), or npm install -g opencode-ai if you’d rather go through npm. Run opencode --version to confirm.

Step 4: Point OpenCode at LM Studio

OpenCode reads provider config from ~/.config/opencode/config.json. Create or edit that file:

{
  "provider": {
    "lmstudio": {
      "name": "LM Studio",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:1234/v1"
      },
      "models": {
        "qwen3.6-27b": { "name": "Qwen 3.6 27B (local)" },
        "deepseek-v3.2": { "name": "DeepSeek V3.2 (local)" }
      }
    }
  },
  "model": "lmstudio/qwen3.6-27b"
}

Two things to know:

The model IDs on the left (qwen3.6-27b, etc.) need to match what LM Studio reports in GET /v1/models. If OpenCode complains it cannot find the model, run that curl command and copy the exact ID.
The model field at the bottom is the default. You can change it mid-session inside OpenCode with /model.

No API key field is needed. LM Studio does not require one, and OpenCode will send a dummy value.

Step 5: Try it

In any project directory:

opencode

You should land in OpenCode’s TUI with the local model selected. Try something small first, like “summarize what this repo does” or “add a docstring to the main function in src/foo.py”. Watch LM Studio’s server log; you will see the requests come in and the tokens stream out.

If the first response is slow, that is the model loading into memory. Subsequent requests in the same session are fast.

What it actually feels like

A few honest observations from a couple of months of this being my default setup:

Latency is fine. First-token latency on a 27B model on M4/M5-class hardware is ~1 second; throughput is fast enough that you do not sit waiting for the agent to finish writing a function.
Tool use works. OpenCode’s file-edit, shell, and search tools all work over the OpenAI-compatible API as long as the model itself supports tool calling. Qwen 3.5/3.6, DeepSeek V3.2, and Llama 4 all do. Codestral is weaker here, so keep it for inline-completion-style work.
Context windows are bigger than you might expect. Qwen 3.6 ships with a 256K native context (extensible to ~1M with YaRN), and Llama 4 Scout advertises 10M tokens. The practical ceiling on a Mac is RAM, not the model’s stated window. Long contexts eat memory fast.
Battery is the real cost. Running a 20B+ model pegs the GPU. On a MacBook unplugged, you will feel it. Plug in for long sessions.

Lightweight alternatives worth knowing about

OpenCode + LM Studio is the path I would put a friend on first, but it is not the only one.

pi.dev. A much newer, deliberately minimal coding harness from Mario Zechner (of libGDX fame). It ships with just four core tools (read, write, edit, bash) and a ~300-word system prompt; everything else is a TypeScript extension you opt into. Install with npm install -g @mariozechner/pi-coding-agent. It speaks the same OpenAI-compatible protocol, so the LM Studio side of this guide is identical; only the harness changes. Worth a look if OpenCode feels heavy for what you are doing, or if you like the idea of a tool that does exactly what you tell it and nothing more.
Ollama. The other obvious model runner. CLI-first, dead simple to script (ollama run qwen3.5:27b), and exposes the same OpenAI-compatible API on http://localhost:11434/v1. If you are someone who would rather configure things from a terminal than click through an app, swap LM Studio out for Ollama and the rest of this guide works unchanged. LM Studio’s edge is the GUI for browsing/quantizing models and the slightly better MLX integration; Ollama’s edge is being scriptable and headless.

You can run both at the same time on different ports if you want to A/B them. They do not conflict.

Where to go from here

Once the basic loop is working, the interesting tweaks are:

Try a second model. Download a reasoning-heavy one (DeepSeek V3.2) alongside your daily driver and switch with /model when a task calls for it. The whole point of local is that swapping costs you nothing.
Wire up Claude Code as a third harness if you have an Anthropic key. Point it at the local LM Studio endpoint via a small proxy when you want its agent loop with an open model behind it. I covered the why of this in the previous post; the how is a 10-line LiteLLM config.
Move models off your boot drive. GGUF and MLX files are large. An external NVMe (the T9 is the one I use) keeps your internal SSD breathable and lets you carry a model library between machines.

That is the whole stack. Two installs, one config file, and a model download. The hardest part is picking which model to try first, and the honest answer there is start with Qwen 3.6 if you can fit it, 3.5 if you can’t.