Context
Follow-up from #2816 (which added enable_watcher_reliable_retry). Surfaced during review of that PR.
Problem
The WATCHER producer backs up its per-node dbt statuses to an Airflow Variable so a retry can restore them. The Variable is deleted on a successful run, and a retry deletes it after restoring. But if the producer fails gracefully with no retries left (retries=0, or the final retry is exhausted), there is no cleanup path, so the backup Variable is orphaned — it accumulates in the metadata DB (and in an external secrets backend, as a stale secret) over time.
This is pre-existing — it has been the case since the per-node Variable backup was introduced in #2559, and affects both enable_watcher_reliable_retry=True (eager) and False (on-failure callback) modes equally. #2816 does not change this behaviour; this issue tracks fixing it separately.
When it happens
- Producer fails gracefully (e.g. a dbt model error).
- The producer has no further retries (
retries=0, or try_number >= max_tries).
- The backup Variable was written (eagerly per-node, or once via the on-failure callback) and is never deleted.
Proposed approaches
- Delete surviving producer backup Variables on DAG-run completion via the existing
cosmos/listeners/dag_run_listener.py (on_dag_run_success/on_dag_run_failed) — robust, also covers hard-kill orphans. This was earmarked in the original BOSS-439 plan.
- Or a lighter in-operator guard: on the final failed attempt, skip the on-failure write and delete any eager backup.
Long term
Superseded by #2771 (Airflow 3.3 Task & Asset Store), which removes the Variable-backup mechanism entirely.
Context
Follow-up from #2816 (which added
enable_watcher_reliable_retry). Surfaced during review of that PR.Problem
The WATCHER producer backs up its per-node dbt statuses to an Airflow Variable so a retry can restore them. The Variable is deleted on a successful run, and a retry deletes it after restoring. But if the producer fails gracefully with no retries left (
retries=0, or the final retry is exhausted), there is no cleanup path, so the backup Variable is orphaned — it accumulates in the metadata DB (and in an external secrets backend, as a stale secret) over time.This is pre-existing — it has been the case since the per-node Variable backup was introduced in #2559, and affects both
enable_watcher_reliable_retry=True(eager) andFalse(on-failure callback) modes equally. #2816 does not change this behaviour; this issue tracks fixing it separately.When it happens
retries=0, ortry_number >= max_tries).Proposed approaches
cosmos/listeners/dag_run_listener.py(on_dag_run_success/on_dag_run_failed) — robust, also covers hard-kill orphans. This was earmarked in the original BOSS-439 plan.Long term
Superseded by #2771 (Airflow 3.3 Task & Asset Store), which removes the Variable-backup mechanism entirely.