Self-hosted LLMs and the context discipline that makes them work

I run my own LLMs. Not because I think the frontier APIs are bad, and not because I want privacy theatre. Because the interesting part of working with language models in 2026 is not the model. It is the discipline you wrap around the model. That discipline is far easier to build and inspect on a stack you fully own.

This post is about the three layers of that discipline as they exist on my desk today: model tiering, subagents with isolated context, and skills as a third layer that hides API plumbing behind one tool call. The hardware is interesting too, but the hardware is the boring part.

The local stack#

The actual machine is one consumer-grade Linux box with a single GPU and 128 GB of RAM. The interesting software lives in roughly four pieces.

llama.cpp as the inference engine. C++, runs anywhere, supports every quantization that matters.
llama-swap as the hot-swap router. Configured with a YAML file that maps model names to launch commands. When a request comes in for a model that is not currently loaded, llama-swap unloads the previous model, loads the requested one, proxies the request through. Hot swap takes 20-40 seconds for a 35B model on this hardware.
Open WebUI as the front door. Chat UI, OpenAI-compatible API gateway, RAG, function calling. Lives at chat.archworks.co, points its OPENAI_API_BASE at the local llama-swap.
OpenCode as the agentic CLI. Reads prompts, plans, calls tools, returns answers. The provider config points at the same Open WebUI endpoint, so it sees the same model menu.

Around that sit two more services that earn their place: SearXNG for web search and Firecrawl for clean content extraction. Both are self-hosted, both run in Docker on the same box.

The model menu is small and deliberate. Four quantizations:

qwen3.5-4b - utility tier, page extraction, classification, anything mechanical
qwen3.6-27b - worker tier, focused subtasks, no thinking mode
qwen3.6-35b-a3b - worker tier, faster variant for high-throughput agent work
qwen3.5-122b-a10b - orchestrator tier, thinking enabled, big planning jobs

The MoE letter math is total params - active params. A 122b-a10b activates 10 billion parameters per token despite holding 122 billion. That is the trick that makes consumer-hardware self-hosting work in 2026. Memory holds the whole model; compute touches only the experts the router picks.

The filing cabinet#

A naive way to use an LLM for research: hand it a question, let it search the web, scrape ten pages, dump the content into context, ask for a report. By page seven the model has forgotten what it read for page two. By the report it is confidently inventing quotes that did not appear in any of the sources.

This is not a small-model problem. Frontier models hallucinate under the same load. The shape of the problem is attention dilution: the model spreads its limited attention across everything in context, and the more noise you stuff in, the less attention each fact gets.

The fix is not bigger models. The fix is smaller contexts.

The mental model I keep coming back to is a filing cabinet. The bad way to research is one person, 50 documents, ten questions: one overloaded human reading and re-reading the same pile, mixing details across questions, forgetting what they read for which question. The good way is ten people, five documents each, one question each: focused readers writing one-page summaries, handing them to a senior analyst who synthesises.

That is what context discipline is. Many workers with clean desks. One orchestrator combining their findings.

Layer one: model tiering#

Different roles need different model capabilities.

Tier	Model	Thinking	Purpose
Orchestrator	122B MoE	enabled	plan, reason, synthesise
Worker	35B MoE (fast variant)	disabled	search, extract, verify
Utility	4B	disabled	page extraction, classification

The orchestrator is the senior analyst. Receives the user's question, plans the breakdown, distributes subtasks, combines answers, checks for gaps, writes the final output. Thinking mode on, because reasoning is the orchestrator's job.

The worker is the focused reader. Gets one specific subtask with a small starting context. Does its retrieval, extracts the relevant facts with source URLs, returns a short distilled answer. Thinking mode off, because reasoning is not the worker's job and thinking burns context for internal monologue that does not improve search quality.

The utility tier exists for one job: turn a raw HTML page into the three sentences the worker actually wanted. A worker never reads a raw page. It calls a utility model with the page and a prompt like "extract any fine amounts and their legal basis", and gets back two hundred tokens of clean extracted facts. Five extracted pages cost a thousand tokens of worker context. Five raw pages would cost forty thousand. The compounding effect is the difference between fitting in context and not.

The total compute is the same. The quality is dramatically different.

Layer two: subagents#

Subagents are the mechanism that makes context isolation real. They are not a UI thing. They are a runtime split where each subtask runs in its own model invocation with its own context window.

The OpenCode shape:

You: "Research X"
         |
         v
+--Orchestrator agent (122B, thinking)--+
|                                       |
|  Breaks X into 8 questions            |
|  For each: spawn a worker subagent    |
|                                       |
+--+-------+-------+-------+-------+----+
   v       v       v       v       v
 worker  worker  worker  worker  worker
  (35B)   (35B)   (35B)   (35B)   (35B)
   |       |       |       |       |
   |       |  each: search, extract, verify
   |       |       |       |       |
   v       v       v       v       v
 answer  answer  answer  answer  answer

Orchestrator: receives 8 short answers (not 8 page dumps)
              cross-checks, writes report

Each worker starts with maybe 500 tokens of context: the question, a short system prompt. It does its work, hits maybe 5,000 tokens of accumulated context by the end, returns 200-400 tokens of distilled findings. The orchestrator never sees the workers' contexts. It sees the answers.

The first time I instrumented this in OpenCode I expected the gain to be modest. The output quality jump was the same as switching from a 35B model to a frontier API on the same prompt. Same compute, same hardware, dramatically different output. The discipline matters more than the model.

There are five rules that make subagents work in practice, and getting any of them wrong undoes the whole gain.

Orchestrated delegation. One strong model plans. Many fast models execute. The orchestrator does the cognitive work that needs the larger model; workers do the focused retrieval where a smaller model is sufficient and faster.

Context isolation. A worker's context never bleeds into another worker or the orchestrator. The orchestrator receives distilled answers, not raw scrapes. Anything that pollutes the orchestrator's context with worker noise defeats the whole pattern.

Extract, do not dump. Workers never put raw pages in their context. They call a utility-tier model to extract first.

Verification pipeline. A separate verifier agent checks claims against their sources after the workers are done. Visits the cited URL, compares the claim, returns CONFIRMED / DEBUNKED / NOT FOUND. This catches the case where a worker said "the fine is 726 EUR according to example.com" and example.com does not actually say 726 EUR.

Graceful degradation. When something fails (search rate-limited, page down, worker out of step budget), the system says so honestly. Empty result marked NOT VERIFIED is always better than a confident fabrication.

The last one is the discipline that takes the longest to internalise. A model that fills gaps from training data sounds confident. The whole point of building a research pipeline is to be more trustworthy than the model talking to itself, and that means an honest gap beats every silent guess.

Layer three: skills#

Subagents handle the "process discipline" side. Skills handle the "API surface" side. Together they let you build agents that touch real systems without drowning in plumbing.

A skill, in OpenCode terms, is a small documented capability with a clear contract. Three layers per skill:

SKILL.md is the public face. It tells the agent when to invoke the skill, what arguments it takes, and what it returns. The agent reads this and decides.
An execution layer that the skill orchestrator runs. Usually a thin script: a Python wrapper around an API, a shell command, a templated HTTP request. Returns structured output.
A reference layer for stable identifiers the script needs but the agent should not have to learn: API IDs, custom field names, common endpoints. Lives next to SKILL.md, loaded only when the skill is invoked.

The win is that the agent's view of "send a Jira ticket" or "send an email" or "create a calendar event" is one tool call. The agent does not see the API client. It does not see the authentication. It does not see the retry logic. It sees a function that takes a small dict and returns a small dict. That is the entire mental model.

Mine has skills for the things I touch daily: Jira, email, calendar, Confluence, time tracking, two different time-tracking systems, a tmux driver, a Joplin notes API, a few infrastructure adapters. Each is a folder under ~/.claude/skills/. Each opens with a short description so the agent knows when to reach for it. Each ends with one or two example invocations so the agent has a working template.

The win when I added skills was concrete: prompts dropped from "search for the Jira ticket about deploying X, paste in the ticket ID, here is my username, here is the API token format" to "create a Jira ticket about deploying X". Less plumbing in the agent's context. More attention left for the actual task.

What this changes in practice#

The local stack on my desk is around 200 GB of model weights on a single GPU. The orchestrator at 122B holds maybe 90 GB in VRAM and system RAM combined. The worker tier at 35B holds 25 GB. The utility tier at 4B is 3 GB and stays resident. Switching between worker and orchestrator takes about 30 seconds, which the pipeline absorbs because most of an agent run is workers in parallel.

On a research task that used to take a frontier API thirty seconds of single-model thinking and produce a report with two confident fabrications, the local stack takes around four minutes and produces a report with verified citations. The four minutes is mostly worker latency. The two confident fabrications are gone.

This is the trade I want to make. Time I pay once. Trustworthiness I get every time after.

The fact that it runs on my hardware is not the point. It is the consequence. The point is that the discipline is mine, the failure modes are mine, the iteration is mine. When something hallucinates, I can read the worker's context. When something stalls, I can read the orchestrator's plan. When a skill is wrong, I can edit the markdown.

The stack does not need to be self-hosted to apply most of this. The five rules of subagent context discipline work against any API. Three-layer skills work against any agent runtime. Model tiering works on any provider that offers multiple model sizes. Self-hosting is what made the discipline visible to me. Building it from the bottom up is what taught me which parts matter.

Bigger models will keep arriving. The pattern around them will outlive any specific model.