Content Pipeline Recovery Without Republishing the Post
A staged publish pipeline can fail after the post is already live. Here is how we recovered the run, avoided a duplicate publish, and still finished with a clean terminal state.
I can live with a pipeline failing before publish. That is annoying, but it is honest. The thing that makes you mutter at the terminal is when publish already happened, the URL is live, and your own controller still swears the run is blocked.
That was the hole in our content pipeline on April 3, 2026. The remote post existed. The route returned 200approve
This is the kind of bug that tells you whether your workflow is an actual system or just a pile of scripts holding hands.
What broke when publish succeeded but the run still failed?
The short version: the remote publish worked, but the local verifier misread the MCP response shape and treated the published post as missing.
That sounds small. It was not. Once the verifier wrote publish-result.jsonfailed
| State | What it meant before the fix | Why it was wrong |
|---|---|---|
| Remote post exists | The content is already live at the canonical URL | This is the ground truth that mattered most |
| The controller thinks publish failed | It reflected one local verification path, not the final external reality |
| | Recovery logic treated the stale failure as a hard stop |
| The run never reaches terminal | The formal end of the pipeline was missing |
That was the core smell. We had built a staged pipeline, but one artifact had too much authority after a partial failure.
There was a second problem sitting next to it. The pipeline contract, the CLI surface, and the controller routing had drifted apart. config/pipeline-stages.jsontext_reconciliation
Why is republishing the wrong content pipeline recovery move?
Because retrying creation is not the same thing as recovering state.
If a remote create call already succeeded, the safe question is not "can I create it again?" The safe question is "what is already true, and how do I bring local state back into line with that?"
That is the whole difference between idempotent recovery and panic scripting. A clean recovery flow should:
- Check whether the canonical post already exists.
- Verify the live route and core content signals.
- Mark the local publish stage complete if the external system is already in the desired state.
- Continue into the remaining formal stages instead of pretending publish was the end.
Anything else is gambling with duplicates, collisions, or a second failure path layered over the first one. Ask me how I know. Actually, don't.
This is also why we stopped hardcoding stage routing in the controller. If stage execution rules live in the contract, the controller can recover from whatever stage is actually configured. If stage routing lives in three separate dictionaries and a bit of muscle memory, recovery becomes folklore.
How does the recovery flow work now?
There were three parts to the fix.
First, stage execution became contract-driven. config/pipeline-stages.jsonstage_runner.pyorchestrate_pipeline.py
Second, we made the asset path real. The writeZCP:ASSETZCP:ASSET-REFasset-plan.jsonvisualstext_reconciliation
Third, approvelive_qapublishlive_qapublished
The diagram below shows the recovery path when the post is already live but local publish state is stale.
flowchart TD
A["approve sees stale failed publish state"] --> B{"canonical post already live?"}
B -->|yes| C["verify slug, zone, and live route"]
C --> D["mark publish complete without second create"]
D --> E["controller resumes into live_qa"]
E --> F["run reaches terminal published"]
B -->|no| G["clear stale failure state"]
G --> H["retry publish through normal controller path"]The recovery logic now does the boring, correct thing. If the post is already live, it heals local state and moves on. If it is not live, it clears the stale failed publish state and retries through the normal path. That is what good workflow recovery looks like. It asks what state the world is actually in, then it narrows the gap.
That principle is not unique to this pipeline. Stripe's idempotency docs are blunt about returning the previously stored result for an already-processed request instead of producing a second one (Stripe idempotent requests). AWS Step Functions frames orchestration as an explicit state machine with defined transitions and terminal states, which is exactly the mental model we wanted back in the controller (AWS Step Functions state machines).
What should you harden in your own pipeline before this happens to you?
Start with the contract. If your controller, CLI, and recovery tooling do not all read the same stage metadata, you are one incident away from discovering which copy is lying.
Then harden the publish boundary:
| Guardrail | Why it matters |
|---|---|
| Contract-driven stage routing | Recovery works off configured reality, not old maps in code |
| Real publish reconciliation | Already-live posts can heal local state without duplicate create calls |
| Marker leakage gating | Planning markers cannot drift into final review, publish, or live QA |
Formal | Terminal |
| Stage attempt normalization | Sync-stage recovery still records a real attempt and a real artifact trail |
I would also keep the text reconciliation rules painfully narrow. This is one of those places where "helpful AI" can get cute very quickly. We built it so the reconciler only edits marker-bound spans. That sounds restrictive because it is restrictive. The goal is not fluent rewriting. The goal is preserving trust.
If you run any staged content system, the lesson is simple: do not treat a partial failure artifact as more authoritative than the external system you were trying to reach in the first place. Let the world tell you what already happened, then reconcile your controller back to that truth.
That is the same pattern we keep landing on in agent systems more broadly. The contract should own orchestration. Recovery should be idempotent. And terminal state should mean something you can verify from the outside.
If you are building the wider operational layer around this kind of workflow, the Agents guide is the broad map, while the ai-workflowsvps-infra
Start with the recovery path before you need it. That is the cheaper day to write it.
FAQ
- What should happen if the post is already live but the pipeline still says publish failed?
Treat the live post as the external source of truth. Verify the canonical slug, zone, and route, then heal the local publish state instead of firing a second create call.
- Why is republishing the wrong recovery move?
Because it solves the wrong problem. Once the remote post exists, you do not need another create request. You need state reconciliation and a clean path into the remaining formal stages.
- How do marker-aware assets fit into this?
They give the pipeline a precise editorial contract. The write stage declares where proof or diagram-dependent prose lives, visuals records what was actually produced, and text reconciliation can strip or block only the exact marker-bound span without touching the rest of the draft.
Tested with Claude Code v2.1.91