Back to VPS & Infra

How I Manage 30+ Docker Services Without Losing My Mind

36 apps, 44 containers, one VPS. Here's the system.

How I Manage 30+ Docker Services Without Losing My Mind

Last updated: 2026-03-31 · Tested against Docker Engine v29.3.1 · Compose v2.35.1 · All metrics from the author's own production VPS.

The first time I watched a deploy break three unrelated services, I thought I'd made a mistake. The second time, I knew the problem wasn't the deploy. The problem was that I had no system. I was carrying the state of 20+ containers in my head, and heads aren't reliable at 11pm when a health alert fires.

So I built one.

Today I run 36 registered apps across 44 containers on a single VPS. Not all of them are active (a few are hibernated or parked), but all of them are tracked, monitored, and recoverable. Here's the exact setup.

What does the registry actually do?

Everything starts with state/apps.registry.json. This single file contains the canonical truth about each service: its name, port, Compose directory, healthcheck URL, database, sync mode, current status, and last deployed commit.

Nothing gets deployed without an entry. Nothing gets monitored without one. The registry is the contract between the deployment process, the health monitor, and the nightly maintenance script.

Port allocation follows a deliberate block structure. Production services live on 3001-3008. Staging occupies 3010-3019. MCP servers start at 3020. Infrastructure like ZeroMemory (3050-3051), Zero-Signals (3060-3061), and Ollama (11434) each have their own band. Readable at a glance: 3003 is always production, 3010 is always staging. No guessing.

How do you choose a sync mode?

The registry tracks a sync_mode field for each service. I use three: git-deploy, live-edit, and image-only.

git-deploy is for apps where the source of truth is a remote GitHub repo. The VPS pulls on deploy. There should never be uncommitted local changes on a git-deploy service. If the nightly maintenance script finds any, it flags them as drift: something changed on the server without going through git first. Twelve of my apps use this mode.

live-edit is for services where I iterate directly on the VPS and push to git afterward. The nightly script auto-commits uncommitted changes with a timestamp message and pushes to origin. Nine use this mode, mostly infrastructure tools and internal dashboards where the feedback loop needs to be fast.

image-only is for services with no source repo on the VPS: n8n, Ollama, MinIO, Plane. They pull from container registries. Lifecycle management is docker pull and a restart. Simple.

The key is matching the mode to how the service actually gets updated, not how you wish it would get updated.

How do you know when something breaks?

The health monitor runs as a containerized daemon using network_mode: host so it can reach ports bound to 127.0.0.1. It checks each registered service's healthcheck URL every 60 seconds. Container status gets the same interval. System resources (CPU load, memory, disk) check every 5 minutes. Backup freshness and manifest drift both check hourly.

Thresholds are specific, calibrated against what actually matters. Disk warns at 80%, escalates at 90%. Memory warns at 85%, escalates at 95%. If a service restarts 3 times, that's a warning. Ten restarts triggers an alert. Endpoint failures get 2 consecutive misses before a warning fires and 5 before escalation.

Alerts route to Discord and ntfy.sh, with a 15-minute cooldown to prevent storms. Routing is graduated: INFO (recovery, normal state) goes to Discord only, WARN adds ntfy.sh, CRITICAL adds a role mention so it actually wakes someone up.

Not every app has a /health endpoint. The registry tracks the actual path per service. ZeroLink uses /api/health, n8n uses /healthz, MinIO uses /minio/health/live. AutoGen Studio uses / because it has no dedicated check path. OpenClaw, currently hibernated, uses null and gets monitored by container status only. Map your actual endpoints, not the ones you assume exist.

How do you stop two operators breaking each other's work?

The blackboard protocol handles this. It lives in state/blackboard.md and has three sections: active tickets, locks, and the append-only update log.

Before any operator or agent takes action on the server, they acquire an atomic file lock using O_EXCL (fail if file exists, POSIX-standard). It records the owner and a UTC timestamp, expires after 60 seconds. If you find a stale one, you delete it and proceed, but you document why. No agent may act without an uncontested entry in the locks section.

Every significant operation gets logged to Recent Updates in what I call Gold Standard format: who, when, which ticket, the classification (DESTRUCTIVE, MIGRATION-GRADE, BACKUP-GRADE, FIX, or INSPECTION), and bullet-point evidence showing commands run, paths touched, and outcomes verified. History is append-only. Corrections go in as new entries prefixed with Correction:, not edits to the original.

When the Recent Updates section exceeds 50 entries, the oldest 40 get archived to a dated file and a breadcrumb stays in the main blackboard pointing to the archive. This keeps the active file readable without losing history.

The blackboard might sound like bureaucratic overhead. Running multiple agents and operators on the same server without it is worse. I learned that the hard way when an incident in early 2026 required me to reconstruct what changed across three sessions with no shared record. Took twice as long as it should have.

What runs automatically every night?

Nightly maintenance fires at 03:30 UTC. Ten steps, fully automated.

Step 1 prunes dangling images, exited containers (except anything named debug or keep), and dangling volumes. Named volumes stay. Build cache older than 7 days gets cleared.

Step 2 runs checks across all registered services, the same curl approach the monitor uses. HTTP 2xx-4xx counts as responding. 5xx is a server error. A connection timeout or refusal means the service is down. This gives me a nightly snapshot independent of the real-time daemon.

Step 3 checks backup freshness: anything older than 26 hours triggers a warning, older than 48 hours escalates.

Steps 4 and 5 log system resources and regenerate state/server.manifest.json from live Docker state. The manifest snapshot is what makes drift detection possible.

Step 6 cleans the workspace: removes content/ directories older than 7 days, with a safety check for code files before deleting anything.

Step 7 checks source control. Live-edit apps with uncommitted changes get auto-committed with a timestamp. Git-deploy apps with local changes get flagged as drift and logged.

Step 8 sends a Telegram summary. Steps 9 and 10 sync git and rotate old logs. The whole thing takes under two minutes.

How do you catch configuration drift?

The manifest snapshot from step 5 gets compared against state/server.manifest.last-approved.json, the baseline from the first snapshot run. The monitor also runs this comparison every hour.

A discrepancy means something changed in the live server state that isn't reflected in what was last approved: a new container appeared, a port changed, a healthcheck status flipped. Most changes are intentional (a new deploy). The protocol is to update the baseline afterward so the comparison stays meaningful.

The manifest itself is detailed: hostname and IP, OS version, kernel, hardware (12 cores, 23GB RAM, 697GB disk), engine version (29.3.1, Compose 5.1.1), all 44 running containers with their images, port mappings, status, and app directory. A full snapshot of server state at a point in time.

Running a diff between current and last-approved is often the fastest way to figure out what changed when something unexpected happens.

Flowchart
graph TD
    R[apps.registry.json] --> HM[Health Monitor]
    R --> NM[Nightly Maintenance]
    HM -->|every 60s| HTTP[HTTP health checks]
    HM -->|every 300s| SYS[System resources]
    HM -->|every 3600s| BF[Backup freshness]
    HM -->|every 3600s| MD[Manifest drift]
    NM -->|03:30 UTC| SNAP[Manifest snapshot]
    NM -->|03:30 UTC| DRIFT[Git drift check]
    NM -->|03:30 UTC| CLEAN[Docker cleanup]
    SNAP --> BASELINE[last-approved.json]
    MD --> BASELINE
    HM --> ALERTS[Discord + ntfy.sh]
    NM --> TG[Telegram summary]
Rendered from Mermaid source.

Frequently Asked Questions

How do you handle rollbacks?

For git-deploy apps, rollback is a git checkout on the VPS followed by a docker compose up -d --build. The registry tracks the last deployed commit per app so I always know what's running. For image-only apps like n8n, it's pulling a specific tagged version and restarting.

What happens if the health monitor itself goes down?

The nightly check in step 2 is independent of the monitor daemon. It runs from the maintenance script directly. If the monitor container is down, I'd lose real-time alerting but the nightly sweep would still catch issues. The monitor container is also in the registry and gets checked like everything else.

Do you need separate VPSes for staging and production?

Not at this scale. Port block separation (production 3001-3008, staging 3010-3019) handles isolation. For strict compliance requirements, a separate VPS makes sense. For a self-hosted stack, port-based separation works fine.

How do you manage secrets across 36 apps?

Environment files per service, stored in the app directory on the VPS and never committed to the repo. The auto-sync cron explicitly excludes .env files. ZeroVault handles shared credentials across services that need them.

What's the biggest thing this system doesn't solve?

Deployment rollouts. Everything here assumes you have working containers. If a new build is broken, you're still reading docker compose logs like everyone else. The registry and monitoring catch the symptoms, but you still have to fix the code.


One rule underlies all of it: the server should never know something that isn't also in git or the registry. State changes get recorded. Operations get logged. Each night, automation validates what's running against what should be.

It's not zero-ops. But it's close enough that I can close my laptop and sleep without worrying about it. Most nights, anyway.

If the Claude Code side of this setup interests you, how I cut token usage per session covers the instruction architecture that makes operating this stack cheaper. The AI workflows zone has more on running agents against a self-hosted stack. And if you want to compare notes on fleet management, I'm on Twitter/X.

Share