Three-layer GitOps on K3s, in production
Table of contents
The cluster I am writing about runs six nodes. Three control-plane, three workers. K3s in HA mode with embedded etcd. Calico for CNI. MetalLB for load balancers. Traefik for ingress. Longhorn for storage. Vault for secrets. ArgoCD orchestrates all of it. Nothing in the cluster was deployed by kubectl apply. Everything was deployed by a git push.
This is about the shape of that, why it works, and the parts that took the longest to get right.
Three layers#
The cluster has three categories of thing in it.
- Bootstrap. Operators and CRDs. Stuff that does not depend on anything else in the cluster. cert-manager, MetalLB, Longhorn, the database operators, the ArgoCD CRDs.
- Platform. Shared services that applications rely on. PostgreSQL clusters, MariaDB clusters, Vault HA, Harbor, Prometheus + Loki stack.
- Apps. Tenant workloads. They use the platform but do not provide it. An AWX instance, a Netbox instance, file browsers, anything user-facing.
These map to three directories in the repo:
argo/
root-bootstrap.yaml sync-wave 0
root-platform.yaml sync-wave 1
root-apps.yaml sync-wave 2
bootstrap/ Layer 0 Applications
platform/ Layer 1 Applications
apps/ Layer 2 Applications
Each root is an ArgoCD Application pointing at the directory below it. Each directory contains child Application resources. This is the App-of-Apps pattern: one Application that creates more Applications.
Sync waves enforce ordering. ArgoCD will not start Layer 1 until Layer 0 is healthy. It will not start Layer 2 until Layer 1 is healthy. A fresh cluster comes up in the right order with one kubectl apply -f root-bootstrap.yaml and then nothing.
What it buys you#
The thing GitOps is selling is not "I can deploy with git". You can do that with a shell script. The thing it is actually selling is reconciliation. ArgoCD watches the cluster and watches the repo, and if they diverge, it fixes the cluster. Always. Automatically. With selfHeal: true and prune: true on every Application, manual changes to the cluster are not just discouraged, they are reverted within 180 seconds.
Three side effects of that.
- The repo is always the truth. If you want to know what is running, look at master. Not at
kubectl get all -A. - Manual hotfixes are forbidden by physics. Edit a Deployment directly and ArgoCD reverts it. The only way to change the cluster is to change the repo.
- Cluster rebuild is
kubectl apply -f root-bootstrap.yamlon a fresh K3s install. Forty minutes later you have the same cluster.
That last property is the one that justifies the work. Disaster recovery is not a runbook. It is one command.
Two deployment patterns#
ArgoCD supports more than one way to render manifests. The repo uses two, picked per app based on whether the app needs secrets.
Multi-source, no secrets. Pull the chart from the upstream Helm repo. Merge it with a values.yaml from the git repo. Optionally add a third source for plain manifests (network policies, resource quotas, anything the chart does not cover). One Application, three sources, no secret handling. Used for most bootstrap and platform apps.
sources:
- repoURL: https://traefik.github.io/charts
chart: traefik
targetRevision: 39.0.1
helm:
releaseName: traefik
valueFiles:
- $values/apps/helm/traefik/values.yaml
- repoURL: <this repo>
targetRevision: master
ref: values
- repoURL: <this repo>
targetRevision: master
path: apps/manifests/traefik
Umbrella chart + AVP, with secrets. A small local Chart.yaml declares the upstream chart as a dependency. A custom ArgoCD management plugin called avp-helm renders this through a pipeline:
helm dependency update
-> helm template
-> sed (URL-decode AVP placeholders)
-> argocd-vault-plugin generate
The sed step is necessary because Helm URL-encodes the <path:...> placeholders that argocd-vault-plugin recognises. Helm sees <path:kv/data/foo#bar> and serialises it as %3Cpath%3A.../%3E. AVP cannot find its placeholders if they are URL-encoded. One regex decodes them. The pipeline runs at every sync, the rendered manifests get the real secret values, and the placeholders never touch the cluster.
Used for: anything with secrets. Database credentials, OIDC client secrets, SMTP passwords, registry credentials. The values.yaml looks like:
config:
database:
password: <path:kv/data/argocd/platform/myapp#db_password>
ArgoCD renders that to the real password at sync time. Vault is the source of truth for the secret. The repo never sees it.
Vault, with the trick#
Vault HA is hard to bootstrap, because Vault HA needs an unseal mechanism that exists before Vault HA does. You cannot unseal Vault with secrets stored in Vault.
The escape hatch is a second, smaller Vault instance running in Transit mode. Transit is an encryption API: it does not store the secrets you want to unseal, it encrypts and decrypts an unseal token on demand. The HA Vault uses the Transit Vault to auto-unseal itself.
Both live in the same namespace. Sync-wave 0 brings up the Transit Vault. Sync-wave 1 brings up the HA Vault, which auto-unseals against Transit and is immediately ready.
The Transit Vault has no application secrets in it. It has only the unseal key for the HA Vault. It is air-gapped from the rest of the platform by namespace network policies and Vault auth policy. Losing the Transit Vault loses the ability to start a fresh HA Vault from cold. Losing the HA Vault loses application secrets. Different blast radii.
Namespaces tell you what something is#
The convention is enforced because it makes everything else easier:
| Suffix | Purpose |
|---|---|
-system | Cluster infrastructure operators (longhorn-system, metallb-system, traefik-system) |
-core | Shared platform dependencies (cnpg-core, vault-core, monitoring-core) |
| (none) | Application namespaces (awx, netbox, filebrowser) |
You can tell from the namespace what tier something is in. You can tell from the namespace whether something is platform-shared or app-private. RBAC policies attach to suffix patterns and apply automatically to new namespaces in the same tier. New tenants drop into the no-suffix tier without having to fight RBAC for a week.
One app per namespace, always. Mixing two unrelated workloads in one namespace is how you accidentally let one app exfiltrate the other's secrets via a misapplied ServiceAccount.
High availability is not a checkbox#
Aggressive failover settings on K3s, because the default is "twelve minutes from node death to pod eviction" and that is not high availability:
| Parameter | Default | Tuned to | Effect |
|---|---|---|---|
node-monitor-period | 5s | 2s | Kubelet health check interval |
node-monitor-grace-period | 40s | 16s | Time before node marked NotReady |
default-not-ready-toleration-seconds | 300s | 30s | Pod eviction delay after NotReady |
default-unreachable-toleration-seconds | 300s | 30s | Pod eviction delay after Unreachable |
Worst-case detection is around 46 seconds. Add pod startup time and you are back in service within about 90 seconds of a node hard-failing. That is the right shape for a small cluster where every node matters. Larger clusters can afford to be more conservative. This size cannot.
The other thing the defaults get wrong is etcd. Default heartbeat is 100 ms, default election timeout is 1000 ms. Both are fine on a quiet cluster. On a cluster with Longhorn's CRD churn (it writes a lot of state to etcd for every volume), the default election timeout occasionally fires under load and triggers a leader election under no real fault. Raised to 500 ms heartbeat and 5000 ms election timeout. The numbers are not magic. They are the lowest values at which spurious elections stopped happening.
Topology spread constraints on every multi-replica workload. maxSkew: 1, whenUnsatisfiable: DoNotSchedule, on kubernetes.io/hostname. Database replicas land on different nodes. Ingress replicas land on different nodes. Etcd lands on different nodes by definition. A single node failure removes at most one replica of anything.
What it does not solve#
GitOps does not solve "what should be in the cluster". Decisions about which database to run for which app, what backups look like, which observability stack to deploy, all still happen outside of git. The repo is downstream of those decisions.
GitOps does not solve "how do I debug". kubectl logs, kubectl describe, kubectl get events, the standard toolset. ArgoCD watches what is deployed. It does not watch what is happening.
GitOps does not solve fast iteration on dev work. The 180-second poll plus the GitLab webhook plus the sync time means a deploy is somewhere between five and thirty seconds end to end. Fine for production-shaped changes. Painful if you want to iterate on a Helm chart by tweaking values. Use a dev cluster with argocd app sync --force shortcuts, or use plain helm upgrade on a non-ArgoCD instance.
GitOps does not solve "the cluster is on fire". It is, in fact, exactly the wrong tool for that. When something is broken and you are not sure why, ArgoCD's self-healing actively works against you: every fix you try gets reverted. Disable auto-sync on the affected app, fix things by hand, commit the fix, re-enable.
What it does#
The combination of "every change is a git push", "ArgoCD always converges", "secrets live in Vault and never touch git", "three layers in dependency order", and "high availability is not a checkbox" gives you a cluster that, six months in, you can still operate.
The repo is around three hundred files. It deploys around forty distinct services. The cluster has been rebuilt from scratch twice for unrelated reasons. Each rebuild took under an hour, end to end, and the rebuild was always the same command. That is the property worth paying for.