Retries, compensating steps, and alerting when workflows fail.
Workflows fail in the real world: APIs time out, databases lock, humans go offline. Good error handling turns chaos into retries, compensations, and clear operator signals instead of silent data drift.
Classify transient versus permanent failures. Transient steps should retry with jitter; permanent failures should stop fast and surface a crisp error code.
Attach correlation IDs across steps so support can trace a single user action through every hop. Log HTTP status bodies at reduced verbosity to avoid leaking secrets.
Undo partial effects when a downstream step fails—especially for payments or external posts. Compensation may mean voiding an invoice, deleting a draft record, or sending a corrective webhook.
Route failures to on-call channels with runbook links. Include workflow name, step name, and last payload hash—not full PII—in the first line of the alert.