What should happen if the post is already live but the pipeline still says publish failed?

Treat the live post as the external source of truth. Verify the canonical slug, zone, and route, then heal the local publish state instead of firing a second create call.

Why is republishing the wrong recovery move?

Because it solves the wrong problem. Once the remote post exists, you do not need another create request. You need state reconciliation and a clean path into the remaining formal stages.

How do marker-aware assets fit into this?

They give the pipeline a precise editorial contract. The write stage declares where proof or diagram-dependent prose lives, visuals records what was actually produced, and text reconciliation can strip or block only the exact marker-bound span without touching the rest of the draft.

Back to Agents

Jimmy Goode/3 April 2026/8 min read/Agents

Content Pipeline Recovery Without Republishing the Post

A staged publish pipeline can fail after the post is already live. Here is how we recovered the run, avoided a duplicate publish, and still finished with a clean terminal state.

content pipeline recovery publishing systems workflow orchestration AI agents ZeroContentPipeline state machines

The Problem: Our publish pipeline could get stuck after the remote post was already live, because local state still thought
snippet.txttext
```
publish
```
had failed.
The Solution: We moved stage routing into the contract, made
snippet.txttext
```
text_reconciliation
```
real, and taught
snippet.txttext
```
approve
```
to verify and heal an already-live post instead of republishing it.
The Result: The pipeline now has a clean recovery path for already-live posts, and successful runs continue through
snippet.txttext
```
live_qa
```
to a real terminal state.

I can live with a pipeline failing before publish. That is annoying, but it is honest. The thing that makes you mutter at the terminal is when publish already happened, the URL is live, and your own controller still swears the run is blocked.

That was the hole in our content pipeline on April 3, 2026. The remote post existed. The route returned

. But local state had wedged itself around a failed verification path, so

approve

had nowhere sensible to go. If a post is already live, recovery should look like reconciliation, not duplication.

This is the kind of bug that tells you whether your workflow is an actual system or just a pile of scripts holding hands.

What broke when publish succeeded but the run still failed?

The short version: the remote publish worked, but the local verifier misread the MCP response shape and treated the published post as missing.

That sounds small. It was not. Once the verifier wrote

publish-result.json

failed

, the rest of the controller logic started treating that failed artifact as the truth. So the run was stuck in a bad loop:

State	What it meant before the fix	Why it was wrong
Remote post exists	The content is already live at the canonical URL	This is the ground truth that mattered most
snippet.txttext `publish_result.status = "failed"`	The controller thinks publish failed	It reflected one local verification path, not the final external reality
snippet.txttext `publish` stage blocked	snippet.txttext `approve` cannot finish the run	Recovery logic treated the stale failure as a hard stop
snippet.txttext `live_qa` never runs	The run never reaches terminal snippet.txttext `published`	The formal end of the pipeline was missing

That was the core smell. We had built a staged pipeline, but one artifact had too much authority after a partial failure.

There was a second problem sitting next to it. The pipeline contract, the CLI surface, and the controller routing had drifted apart.

config/pipeline-stages.json

knew about stages that the CLI did not accept. The controller still carried stage-name maps in code. And

text_reconciliation

existed more as intention than as a real mechanical stage. That sort of drift does not always blow up immediately. It just waits until recovery day, when every hardcoded shortcut suddenly matters.

Why is republishing the wrong content pipeline recovery move?

Because retrying creation is not the same thing as recovering state.

If a remote create call already succeeded, the safe question is not "can I create it again?" The safe question is "what is already true, and how do I bring local state back into line with that?"

That is the whole difference between idempotent recovery and panic scripting. A clean recovery flow should:

Check whether the canonical post already exists.
Verify the live route and core content signals.
Mark the local publish stage complete if the external system is already in the desired state.
Continue into the remaining formal stages instead of pretending publish was the end.

Anything else is gambling with duplicates, collisions, or a second failure path layered over the first one. Ask me how I know. Actually, don't.

This is also why we stopped hardcoding stage routing in the controller. If stage execution rules live in the contract, the controller can recover from whatever stage is actually configured. If stage routing lives in three separate dictionaries and a bit of muscle memory, recovery becomes folklore.

How does the recovery flow work now?

There were three parts to the fix.

First, stage execution became contract-driven.

config/pipeline-stages.json

now carries the runner type and default agent for each stage, so

stage_runner.py

and

orchestrate_pipeline.py

stop pretending they are the source of truth.

Second, we made the asset path real. The

write

stage now parses hidden

ZCP:ASSET

and

ZCP:ASSET-REF

markers into a real

asset-plan.json

visuals

syncs the runtime manifest, and

text_reconciliation

decides whether marker-bound prose stays, gets stripped, gets removed, or blocks the run. The important rule is narrow and strict: it only touches the span bound to the markers. No invented facts. No wandering edits. No unsupported proof claims sneaking through because the prose "still sounded fine."

Third,

approve

stopped acting like publish was the finish line. After a successful publish outcome, the controller re-enters and runs

live_qa

. That means the terminal truth is now what we actually wanted all along:

publish

completed,

live_qa

completed, run status

published

The diagram below shows the recovery path when the post is already live but local publish state is stale.

Flowchart

8 linescompact

flowchart TD
  A["approve sees stale failed publish state"] --> B{"canonical post already live?"}
  B -->|yes| C["verify slug, zone, and live route"]
  C --> D["mark publish complete without second create"]
  D --> E["controller resumes into live_qa"]
  E --> F["run reaches terminal published"]
  B -->|no| G["clear stale failure state"]
  G --> H["retry publish through normal controller path"]

Rendered from Mermaid source with the native ZeroLabs diagram container.

The recovery logic now does the boring, correct thing. If the post is already live, it heals local state and moves on. If it is not live, it clears the stale failed publish state and retries through the normal path. That is what good workflow recovery looks like. It asks what state the world is actually in, then it narrows the gap.

That principle is not unique to this pipeline. Stripe's idempotency docs are blunt about returning the previously stored result for an already-processed request instead of producing a second one (Stripe idempotent requests). AWS Step Functions frames orchestration as an explicit state machine with defined transitions and terminal states, which is exactly the mental model we wanted back in the controller (AWS Step Functions state machines).

What should you harden in your own pipeline before this happens to you?

Start with the contract. If your controller, CLI, and recovery tooling do not all read the same stage metadata, you are one incident away from discovering which copy is lying.

Then harden the publish boundary:

Guardrail	Why it matters
Contract-driven stage routing	Recovery works off configured reality, not old maps in code
Real publish reconciliation	Already-live posts can heal local state without duplicate create calls
Marker leakage gating	Planning markers cannot drift into final review, publish, or live QA
Formal snippet.txttext `live_qa` after publish	Terminal snippet.txttext `published` means the route actually passed
Stage attempt normalization	Sync-stage recovery still records a real attempt and a real artifact trail

I would also keep the text reconciliation rules painfully narrow. This is one of those places where "helpful AI" can get cute very quickly. We built it so the reconciler only edits marker-bound spans. That sounds restrictive because it is restrictive. The goal is not fluent rewriting. The goal is preserving trust.

If you run any staged content system, the lesson is simple: do not treat a partial failure artifact as more authoritative than the external system you were trying to reach in the first place. Let the world tell you what already happened, then reconcile your controller back to that truth.

That is the same pattern we keep landing on in agent systems more broadly. The contract should own orchestration. Recovery should be idempotent. And terminal state should mean something you can verify from the outside.

If you are building the wider operational layer around this kind of workflow, the Agents guide is the broad map, while the

ai-workflows

and

vps-infra

zones cover the process and infrastructure pieces that sit underneath it.

Start with the recovery path before you need it. That is the cheaper day to write it.

FAQ

What should happen if the post is already live but the pipeline still says publish failed?: Treat the live post as the external source of truth. Verify the canonical slug, zone, and route, then heal the local publish state instead of firing a second create call.

Why is republishing the wrong recovery move?: Because it solves the wrong problem. Once the remote post exists, you do not need another create request. You need state reconciliation and a clean path into the remaining formal stages.

How do marker-aware assets fit into this?: They give the pipeline a precise editorial contract. The write stage declares where proof or diagram-dependent prose lives, visuals records what was actually produced, and text reconciliation can strip or block only the exact marker-bound span without touching the rest of the draft.

Tested with Claude Code v2.1.91

Jimmy Goode

AI Systems Architect at ZeroShot Studio

Builds production multi-agent AI systems and automation infrastructure. Previously founded and operated Australia's first communal motorcycle workshop, scaling it to 1,000+ members and $1M+ annual turnover with zero employees. Now applies that operator mindset to AI.

jimmygoode.com →

LinkedIn GitHub Instagram Reddit

Content Pipeline Recovery Without Republishing the Post

What broke when publish succeeded but the run still failed?

Why is republishing the wrong content pipeline recovery move?

How does the recovery flow work now?

What should you harden in your own pipeline before this happens to you?

FAQ

More from Agents

The Agent Boundary Problem

What Are AI Agents? The Complete Guide to Autonomous AI

You Don't Need an AI Agent