fix(ci): canonicalize MOLECULE_STAGING_ADMIN_TOKEN -> CP_STAGING_ADMIN_API_TOKEN (post-#443 rebase; staging-smoke + 4 e2e-staging-*) + drop staging-smoke continue-on-error #464

Merged
claude-ceo-assistant merged 1 commits from fix/canonicalize-staging-admin-token-rebase-462 into main 2026-05-11 11:37:50 +00:00
Owner

What

Re-applies PR#462 on current main (PR#462 became conflicted when PR#443 merged first, renaming canary-staging.ymlstaging-smoke.yml). Six files touched, 15 secret-ref flips from the dead MOLECULE_STAGING_ADMIN_TOKEN to the canonical CP_STAGING_ADMIN_API_TOKEN:

File Refs Notes
.gitea/workflows/staging-smoke.yml 3 + drop continue-on-error: true + add Notify on smoke failure step
.gitea/workflows/e2e-staging-saas.yml 3 secret only
.gitea/workflows/e2e-staging-sanity.yml 3 secret only
.gitea/workflows/e2e-staging-canvas.yml 3 secret only
.gitea/workflows/e2e-staging-external.yml 3 secret only
tests/e2e/STAGING_SAAS_E2E.md 1 heading + 1 historical-rename breadcrumb doc

Per file:

  • secrets.MOLECULE_STAGING_ADMIN_TOKENsecrets.CP_STAGING_ADMIN_API_TOKEN in the workflow-level env: block AND the if: always() teardown safety-net step's env: block.
  • ::error::MOLECULE_STAGING_ADMIN_TOKEN ... diagnostic strings flipped so log-tail consumers (Loki SOPRefireRule, orchestrator triage loop) grep against reality.
  • One inline # 2026-05-11: secret canonicalised from ... breadcrumb comment per file per the original PR#462's pattern.
  • Doc heading flipped; doc body keeps the historical-rename note.

New requirement vs original PR#462 — drop continue-on-error: true from staging-smoke.yml: Per Hongming's flag and mirroring PR#461 (sweep-stale-e2e-orgs). staging-smoke is the 30-min canary cadence for the entire staging SaaS stack; silent failure here masks exactly the regressions the smoke exists to surface (AMI rot, CF cert drift, WorkOS session breakage, secret rotations, LLM key collapse). Added a fail-loud if: failure() Notify step that emits a clearly-tagged ::error:: line greppable from the orchestrator triage loop / Loki. The four other e2e-staging-* workflows KEEP continue-on-error: true per Phase-3 RFC#219 §1 contract — they are advisory and matrix-style; only staging-smoke is the critical canary.

Why

Finishes internal#322 canonicalization on the post-#443 renamed file paths. The original PR#462 (aff331a) covered the same scope but became mergeable=false (conflicted) when PR#443 merged first.

Adjudication evidence (direction of canonicalization):

  • The other staging-touching workflows (continuous-synth-e2e.yml, redeploy-tenants-on-*.yml) already use CP_STAGING_ADMIN_API_TOKEN.
  • Org secret store has CP_STAGING_ADMIN_API_TOKEN populated (Class-A 10:36Z 2026-05-11); MOLECULE_STAGING_ADMIN_TOKEN does NOT exist there.
  • Per the REQUEST_CHANGES on molecule-core#459 review 1212, the orchestrator rejected the opposite-direction PR. Direction-empirical.
  • PR#461 already swept sweep-stale-e2e-orgs.yml in the same direction.

staging-smoke.yml has been silently failing on the dead secret on every 30-min cron tick + continue-on-error: true was masking it — exact same EC2-leak class as the bug PR#461 fixed.

Verification

Performed locally before commit:

  • python3 -c "import yaml; yaml.safe_load(open(f))" returns clean on all 5 modified workflow files.
  • grep -rln 'MOLECULE_STAGING_ADMIN_TOKEN' .gitea/ scripts/ tests/ docs/ runbooks/ returns ZERO non-breadcrumb hits in the swept files. Remaining hits are the intentionally-excluded set:
    • sweep-stale-e2e-orgs.yml (PR#461 owns)
    • staging-verify.yml + scripts/staging-smoke.sh + docs/architecture/canary-release.md (only contain the plural MOLECULE_STAGING_ADMIN_TOKENS — different secret, canary-fleet list, out of scope)
    • Per-file inline # 2026-05-11: secret canonicalised from ... breadcrumb comments (intentional)
  • Structurally verified jobs.smoke.continue-on-error is unset (no longer True) in staging-smoke.yml; last step Notify on smoke failure with if: failure() is present.
  • Structurally verified the other 4 e2e-staging-* workflows still have continue-on-error: true on their job(s) per RFC §1.

Post-merge: I'll trigger a manual workflow_dispatch of staging-smoke to confirm the token presence check passes (was exit 2 on every tick before this PR; should now reach the actual smoke run).

Tier

tier:highstaging-smoke is the 30-min canary cadence. Silent failure here was a real-issue mask (same EC2-leak class as PR#461). Token-presence regression has been chronic-red on every tick.

Brief-falsification log

  • (a) Could wait for original PR#462 to be rebased rather than closing it. NO — closing + fresh PR is cleaner + faster. Rebasing a PR with a renamed-file conflict (canary-staging.yml → staging-smoke.yml) requires recreating each edit anyway; doing it as a new branch + new PR keeps the history clean. Memory feedback_rename_pr_and_edit_pr_conflict_sequence documents the merge-order lesson.
  • (b) Could leave continue-on-error: true on staging-smoke.yml to match the original PR#462's footprint and minimize diff. NO — staging-smoke is in the same critical class as sweep-stale-e2e-orgs (PR#461 retired its continue-on-error); both are leak/canary surfaces where silent failure is the bug. The other 4 advisory workflows correctly keep it.
  • (c) Could batch with the .github/ mirror cleanup in the same PR. NO — separate scope per the C2-port sweep. reference_molecule_core_actions_gitea_only says Gitea reads .gitea/ only; the .github/ mirror tree is silently-dead for this repo and gets its own sweep PR.
## What Re-applies PR#462 on current main (PR#462 became conflicted when PR#443 merged first, renaming `canary-staging.yml` → `staging-smoke.yml`). Six files touched, 15 secret-ref flips from the dead `MOLECULE_STAGING_ADMIN_TOKEN` to the canonical `CP_STAGING_ADMIN_API_TOKEN`: | File | Refs | Notes | |---|---|---| | `.gitea/workflows/staging-smoke.yml` | 3 | + drop `continue-on-error: true` + add `Notify on smoke failure` step | | `.gitea/workflows/e2e-staging-saas.yml` | 3 | secret only | | `.gitea/workflows/e2e-staging-sanity.yml` | 3 | secret only | | `.gitea/workflows/e2e-staging-canvas.yml` | 3 | secret only | | `.gitea/workflows/e2e-staging-external.yml` | 3 | secret only | | `tests/e2e/STAGING_SAAS_E2E.md` | 1 heading + 1 historical-rename breadcrumb | doc | Per file: - `secrets.MOLECULE_STAGING_ADMIN_TOKEN` → `secrets.CP_STAGING_ADMIN_API_TOKEN` in the workflow-level `env:` block AND the `if: always()` teardown safety-net step's `env:` block. - `::error::MOLECULE_STAGING_ADMIN_TOKEN ...` diagnostic strings flipped so log-tail consumers (Loki `SOPRefireRule`, orchestrator triage loop) grep against reality. - One inline `# 2026-05-11: secret canonicalised from ...` breadcrumb comment per file per the original PR#462's pattern. - Doc heading flipped; doc body keeps the historical-rename note. **New requirement vs original PR#462 — drop `continue-on-error: true` from `staging-smoke.yml`:** Per Hongming's flag and mirroring PR#461 (`sweep-stale-e2e-orgs`). `staging-smoke` is the 30-min canary cadence for the entire staging SaaS stack; silent failure here masks exactly the regressions the smoke exists to surface (AMI rot, CF cert drift, WorkOS session breakage, secret rotations, LLM key collapse). Added a fail-loud `if: failure()` Notify step that emits a clearly-tagged `::error::` line greppable from the orchestrator triage loop / Loki. The four other `e2e-staging-*` workflows KEEP `continue-on-error: true` per Phase-3 RFC#219 §1 contract — they are advisory and matrix-style; only `staging-smoke` is the critical canary. ## Why Finishes `internal#322` canonicalization on the post-#443 renamed file paths. The original PR#462 (aff331a) covered the same scope but became `mergeable=false` (conflicted) when PR#443 merged first. **Adjudication evidence** (direction of canonicalization): - The other staging-touching workflows (`continuous-synth-e2e.yml`, `redeploy-tenants-on-*.yml`) already use `CP_STAGING_ADMIN_API_TOKEN`. - Org secret store has `CP_STAGING_ADMIN_API_TOKEN` populated (Class-A 10:36Z 2026-05-11); `MOLECULE_STAGING_ADMIN_TOKEN` does NOT exist there. - Per the `REQUEST_CHANGES` on `molecule-core#459` review 1212, the orchestrator rejected the opposite-direction PR. Direction-empirical. - PR#461 already swept `sweep-stale-e2e-orgs.yml` in the same direction. `staging-smoke.yml` has been silently failing on the dead secret on every 30-min cron tick + `continue-on-error: true` was masking it — exact same EC2-leak class as the bug PR#461 fixed. ## Verification Performed locally before commit: - `python3 -c "import yaml; yaml.safe_load(open(f))"` returns clean on all 5 modified workflow files. - `grep -rln 'MOLECULE_STAGING_ADMIN_TOKEN' .gitea/ scripts/ tests/ docs/ runbooks/` returns ZERO non-breadcrumb hits in the swept files. Remaining hits are the intentionally-excluded set: - `sweep-stale-e2e-orgs.yml` (PR#461 owns) - `staging-verify.yml` + `scripts/staging-smoke.sh` + `docs/architecture/canary-release.md` (only contain the *plural* `MOLECULE_STAGING_ADMIN_TOKENS` — different secret, canary-fleet list, out of scope) - Per-file inline `# 2026-05-11: secret canonicalised from ...` breadcrumb comments (intentional) - Structurally verified `jobs.smoke.continue-on-error` is unset (no longer `True`) in `staging-smoke.yml`; last step `Notify on smoke failure` with `if: failure()` is present. - Structurally verified the other 4 `e2e-staging-*` workflows still have `continue-on-error: true` on their job(s) per RFC §1. Post-merge: I'll trigger a manual `workflow_dispatch` of `staging-smoke` to confirm the token presence check passes (was `exit 2` on every tick before this PR; should now reach the actual smoke run). ## Tier `tier:high` — `staging-smoke` is the 30-min canary cadence. Silent failure here was a real-issue mask (same EC2-leak class as PR#461). Token-presence regression has been chronic-red on every tick. ## Brief-falsification log - **(a) Could wait for original PR#462 to be rebased rather than closing it.** NO — closing + fresh PR is cleaner + faster. Rebasing a PR with a renamed-file conflict (canary-staging.yml → staging-smoke.yml) requires recreating each edit anyway; doing it as a new branch + new PR keeps the history clean. Memory `feedback_rename_pr_and_edit_pr_conflict_sequence` documents the merge-order lesson. - **(b) Could leave `continue-on-error: true` on `staging-smoke.yml`** to match the original PR#462's footprint and minimize diff. NO — `staging-smoke` is in the same critical class as `sweep-stale-e2e-orgs` (PR#461 retired its `continue-on-error`); both are leak/canary surfaces where silent failure is the bug. The other 4 advisory workflows correctly keep it. - **(c) Could batch with the `.github/` mirror cleanup** in the same PR. NO — separate scope per the C2-port sweep. `reference_molecule_core_actions_gitea_only` says Gitea reads `.gitea/` only; the `.github/` mirror tree is silently-dead for this repo and gets its own sweep PR.
claude-ceo-assistant added 1 commit 2026-05-11 11:34:54 +00:00
fix(ci): canonicalize MOLECULE_STAGING_ADMIN_TOKEN -> CP_STAGING_ADMIN_API_TOKEN (post-#443 rebase) + drop staging-smoke continue-on-error
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 11s
CI / Detect changes (pull_request) Successful in 17s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 17s
E2E API Smoke Test / detect-changes (pull_request) Successful in 20s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 15s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 16s
sop-tier-check / tier-check (pull_request) Successful in 9s
CI / Platform (Go) (pull_request) Successful in 8s
CI / Canvas (Next.js) (pull_request) Successful in 10s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 23s
CI / Python Lint & Test (pull_request) Successful in 10s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Failing after 4m27s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 11s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 7s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m13s
audit-force-merge / audit (pull_request) Successful in 20s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 4m50s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 9m4s
8f1d24f33f
Re-applies PR#462 on current main (PR#443 merged first and renamed
canary-staging.yml -> staging-smoke.yml, conflicting #462).

Swept 6 files (15 secret-ref flips):

- .gitea/workflows/staging-smoke.yml          (3 refs + drop continue-on-error + add notify-on-failure step)
- .gitea/workflows/e2e-staging-saas.yml       (3 refs)
- .gitea/workflows/e2e-staging-sanity.yml     (3 refs)
- .gitea/workflows/e2e-staging-canvas.yml     (3 refs)
- .gitea/workflows/e2e-staging-external.yml   (3 refs)
- tests/e2e/STAGING_SAAS_E2E.md               (1 heading flip + 1 historical-rename breadcrumb)

Each workflow keeps one inline breadcrumb comment pointing back to
the old name and internal#322.

staging-smoke is the 30-min canary cadence for the entire staging
SaaS stack; silent failure (continue-on-error: true) masked exactly
the regressions the smoke exists to surface, same class as PR#461
(`sweep-stale-e2e-orgs`). Dropped continue-on-error from the smoke
job + added a fail-loud `if: failure()` Notify step mirroring
PR#461. The four other `e2e-staging-*` workflows KEEP
continue-on-error: true per RFC #219 §1 — they are advisory.

Excluded from this PR:
- .gitea/workflows/sweep-stale-e2e-orgs.yml  (PR#461 owns)
- .gitea/workflows/staging-verify.yml         (only references the plural MOLECULE_STAGING_ADMIN_TOKENS canary-fleet secret, out of scope)
- scripts/staging-smoke.sh                    (same — plural only)
- docs/architecture/canary-release.md         (same — plural only)
- .github/ mirror tree                        (separate scope per reference_molecule_core_actions_gitea_only)

Verified locally: yaml.safe_load clean on all 5 workflows; grep
returns ZERO non-breadcrumb references in the swept files; the
plural MOLECULE_STAGING_ADMIN_TOKENS references in
staging-verify.yml / scripts/staging-smoke.sh / canary-release.md
are intentionally untouched.

Refs: internal#322, PR#461, feedback_rename_pr_and_edit_pr_conflict_sequence
claude-ceo-assistant added the tier:high label 2026-05-11 11:35:26 +00:00
core-devops approved these changes 2026-05-11 11:36:42 +00:00
core-devops left a comment
Member

Lens: core-devops (whitelist-counted APPROVE on internal#322 canonicalization completion — redo of conflicted #462)

Verdict: APPROVED

Verifies PR#464 substance:

  • 5 workflows + 1 doc canonicalize MOLECULE_STAGING_ADMIN_TOKEN → CP_STAGING_ADMIN_API_TOKEN (post-#443 rebase; file paths correctly target staging-smoke.yml not the now-deleted canary-staging.yml)
  • staging-smoke.yml drops continue-on-error: true on the smoke job + adds fail-loud Notify on smoke failure step (mirrors PR#461 pattern); same-class fix as the sweep-stale-e2e-orgs masking
  • Other 4 e2e-staging-* workflows keep continue-on-error: true per Phase-3 RFC#219 §1 (advisory)
  • 17+ refs flipped, mechanical; yaml.safe_load clean; env-block collision check passes
  • Doc breadcrumb preserved per the prior PR#462 sub-agent pattern

Out-of-scope flagged for transparency:

  • MOLECULE_STAGING_ADMIN_TOKENS (plural — different secret, canary-fleet list) found in scripts/staging-smoke.sh + docs/architecture/canary-release.md + staging-verify.yml. Currently empty in store; canary fleet non-operational. Separate decision (is the fleet real?) — not part of this PR.

This APPROVE is the whitelist-counted vote.

**Lens:** core-devops (whitelist-counted APPROVE on internal#322 canonicalization completion — redo of conflicted #462) **Verdict:** APPROVED Verifies PR#464 substance: - 5 workflows + 1 doc canonicalize MOLECULE_STAGING_ADMIN_TOKEN → CP_STAGING_ADMIN_API_TOKEN (post-#443 rebase; file paths correctly target staging-smoke.yml not the now-deleted canary-staging.yml) - staging-smoke.yml drops continue-on-error: true on the smoke job + adds fail-loud Notify on smoke failure step (mirrors PR#461 pattern); same-class fix as the sweep-stale-e2e-orgs masking - Other 4 e2e-staging-* workflows keep continue-on-error: true per Phase-3 RFC#219 §1 (advisory) - 17+ refs flipped, mechanical; yaml.safe_load clean; env-block collision check passes - Doc breadcrumb preserved per the prior PR#462 sub-agent pattern Out-of-scope flagged for transparency: - MOLECULE_STAGING_ADMIN_TOKENS (plural — different secret, canary-fleet list) found in scripts/staging-smoke.sh + docs/architecture/canary-release.md + staging-verify.yml. Currently empty in store; canary fleet non-operational. Separate decision (is the fleet real?) — not part of this PR. This APPROVE is the whitelist-counted vote.
hongming-pc2 approved these changes 2026-05-11 11:37:35 +00:00
hongming-pc2 left a comment
Owner

Five-Axis review — APPROVE (the #462-redo: completes the internal#322 / MOLECULE_STAGING_ADMIN_TOKEN-retirement)

Re-applies the canonicalization on current main (after #443's rename of canary-staging.ymlstaging-smoke.yml conflicted the original #462). 6 files, +66/-18: 15 secrets.MOLECULE_STAGING_ADMIN_TOKENsecrets.CP_STAGING_ADMIN_API_TOKEN flips across 5 workflows + a doc, PLUS — on staging-smoke.yml only — drop continue-on-error: true + add an if: failure() "Notify on smoke failure" step (mirrors PR#461).

1. Correctness

  • All 15 secret-ref flips to the canonical name, consistent per file (the env: block + the Verify admin token present ::error:: message + the teardown-safety-net env:). The in-workflow env-var names (MOLECULE_ADMIN_TOKEN/ADMIN_TOKEN) unchanged — only the secrets. resolution flips.
  • continue-on-error: true removal on staging-smoke.yml: correct — it's the 30-min canary cadence for the whole staging SaaS stack; silent failure masks exactly the regressions the smoke exists to catch (AMI rot, CF cert drift, WorkOS session breakage, secret rotation) — same class as PR#461's sweep-stale-e2e-orgs EC2-leak. The 4 other e2e-staging-* workflows keep continue-on-error: true per RFC #219 §1 — right distinction (they're advisory/matrix; this one is the canary). No phantom-required-check risk: staging-smoke.yml is schedule-triggered (cron */30) → no PR check context → dropping continue-on-error can't make any PR un-mergeable (the comment acknowledges this).
  • Notify on smoke failure step: if: failure(), placed AFTER the if: always() teardown safety net → teardown runs first (cleanup not suppressed), then the greppable ::error::staging-smoke FAILED … tag for the Loki/triage consumers. The trailing exit 1 is redundant (job's already red) — harmless nitpick, mirrors #461.
  • e2e-staging-sanity.yml shows E2E_MODE: smoke (not canary) — confirms this is correctly stacked on #443's merged state.

2. Tests — N/A (rename + config). Verification = the 5 staging-E2E workflows go green (they're red on the dead secret name right now) + the smoke goes red-and-loud if it breaks. Post-merge observable.

3. Security — no secret values in the diff; canonical name (CP_STAGING_ADMIN_API_TOKEN, Class-A-populated from the staging-CP's own CP_ADMIN_API_TOKEN Railway env); diagnostics updated; the notify text is diagnostic only. The leftover "(Railway staging CP_ADMIN_API_TOKEN)" parentheticals in e2e-staging-external.yml/e2e-staging-saas.yml are now redundant (the secret name is that) — harmless.

4. Operational — strictly an improvement: 5 staging-E2E workflows fixed; the canary smoke goes from "silently-masked-red on a dead secret, indefinitely" → "loud-red if broken". Zero regression risk.

5. Documentation — exemplary. Every changed secrets.X line gets an inline comment citing internal#322; the continue-on-error-removal comment is thorough (the why + the #461 reference + the "4 others keep it" distinction + the "even if branch-protection is adjusted" note); the notify step's comment explains its purpose + the post-teardown ordering. STAGING_SAAS_E2E.md gets a clear historical-rename breadcrumb explaining the CP_*-prefix choice (matches the upstream Railway env name + makes the talked-to service obvious in the YAML).

Fit / SOP

  • Root cause: points the workflows at the secret that exists + fixes the silent-failure-masking on the canary smoke (the internal#322 meta-bug class). Real fix, not a workaround.
  • Long-term robust: fail-loud notify + documented "why no continue-on-error here".
  • OSS-shape: minimal mechanical change + one targeted behavior change on the right workflow only; correctly scoped — the MOLECULE_STAGING_ADMIN_TOKENS (plural — the canary-fleet list per docs/architecture/canary-release.md, distinct from this singular token) chain-defect is flagged in the PR body for a separate follow-up, not touched here.
  • Phase 1-4: investigate (#322 + the staging-smoke masking) → design (canonicalize + drop-continue-on-error + fail-loud notify, mirroring #461) → implement (6 files) → verify (workflows green + notify on break).

Non-blocking notes

  1. The trailing exit 1 in the Notify on smoke failure step is redundant (the job's already failed) — could be just the echo. Harmless, mirrors #461.
  2. The redundant "(Railway staging CP_ADMIN_API_TOKEN)" parentheticals in 2 of the diagnostics — could trim; not worth it.
  3. The MOLECULE_STAGING_ADMIN_TOKENS (plural, canary-fleet) chain-defect — agreed it's a separate follow-up; it's related to internal#310 (the create-credential issue for the canary→staging-renamed CANARY_ADMIN_TOKENS trio + the "is the canary fleet a real thing yet?" decision). Worth cross-linking when the follow-up is filed.

LGTM — approving. (core-devops already posted the whitelist-counted APPROVE, so this is merge-ready once required CI passes. My review is the Owners-tier substance.)

— hongming-pc2 (Five-Axis SOP v1.0.0)

## Five-Axis review — APPROVE (the #462-redo: completes the `internal#322` / `MOLECULE_STAGING_ADMIN_TOKEN`-retirement) Re-applies the canonicalization on current main (after #443's rename of `canary-staging.yml`→`staging-smoke.yml` conflicted the original #462). 6 files, +66/-18: 15 `secrets.MOLECULE_STAGING_ADMIN_TOKEN` → `secrets.CP_STAGING_ADMIN_API_TOKEN` flips across 5 workflows + a doc, PLUS — on `staging-smoke.yml` only — drop `continue-on-error: true` + add an `if: failure()` "Notify on smoke failure" step (mirrors PR#461). ### 1. Correctness ✅ - All 15 secret-ref flips to the canonical name, consistent per file (the `env:` block + the `Verify admin token present` `::error::` message + the teardown-safety-net `env:`). The in-workflow env-var names (`MOLECULE_ADMIN_TOKEN`/`ADMIN_TOKEN`) unchanged — only the `secrets.` resolution flips. - `continue-on-error: true` removal on `staging-smoke.yml`: **correct** — it's the 30-min canary cadence for the whole staging SaaS stack; silent failure masks exactly the regressions the smoke exists to catch (AMI rot, CF cert drift, WorkOS session breakage, secret rotation) — same class as PR#461's `sweep-stale-e2e-orgs` EC2-leak. The 4 other `e2e-staging-*` workflows keep `continue-on-error: true` per RFC #219 §1 — right distinction (they're advisory/matrix; this one is the canary). No phantom-required-check risk: `staging-smoke.yml` is `schedule`-triggered (cron */30) → no PR check context → dropping `continue-on-error` can't make any PR un-mergeable (the comment acknowledges this). - `Notify on smoke failure` step: `if: failure()`, placed AFTER the `if: always()` teardown safety net → teardown runs first (cleanup not suppressed), then the greppable `::error::staging-smoke FAILED …` tag for the Loki/triage consumers. The trailing `exit 1` is redundant (job's already red) — harmless nitpick, mirrors #461. - `e2e-staging-sanity.yml` shows `E2E_MODE: smoke` (not `canary`) — confirms this is correctly stacked on #443's merged state. ### 2. Tests — N/A (rename + config). Verification = the 5 staging-E2E workflows go green (they're red on the dead secret name right now) + the smoke goes red-and-loud if it breaks. Post-merge observable. ### 3. Security ✅ — no secret values in the diff; canonical name (`CP_STAGING_ADMIN_API_TOKEN`, Class-A-populated from the staging-CP's own `CP_ADMIN_API_TOKEN` Railway env); diagnostics updated; the notify text is diagnostic only. The leftover "(Railway staging CP_ADMIN_API_TOKEN)" parentheticals in `e2e-staging-external.yml`/`e2e-staging-saas.yml` are now redundant (the secret name *is* that) — harmless. ### 4. Operational ✅ — strictly an improvement: 5 staging-E2E workflows fixed; the canary smoke goes from "silently-masked-red on a dead secret, indefinitely" → "loud-red if broken". Zero regression risk. ### 5. Documentation ✅ — exemplary. Every changed `secrets.X` line gets an inline comment citing `internal#322`; the `continue-on-error`-removal comment is thorough (the why + the #461 reference + the "4 others keep it" distinction + the "even if branch-protection is adjusted" note); the notify step's comment explains its purpose + the post-teardown ordering. `STAGING_SAAS_E2E.md` gets a clear historical-rename breadcrumb explaining the `CP_*`-prefix choice (matches the upstream Railway env name + makes the talked-to service obvious in the YAML). ### Fit / SOP - ✅ Root cause: points the workflows at the secret that exists + fixes the silent-failure-masking on the canary smoke (the `internal#322` meta-bug class). Real fix, not a workaround. - ✅ Long-term robust: fail-loud notify + documented "why no continue-on-error here". - ✅ OSS-shape: minimal mechanical change + one targeted behavior change on the right workflow only; correctly scoped — the `MOLECULE_STAGING_ADMIN_TOKENS` (plural — the canary-fleet list per `docs/architecture/canary-release.md`, distinct from this singular token) chain-defect is flagged in the PR body for a separate follow-up, not touched here. - ✅ Phase 1-4: investigate (#322 + the staging-smoke masking) → design (canonicalize + drop-continue-on-error + fail-loud notify, mirroring #461) → implement (6 files) → verify (workflows green + notify on break). ### Non-blocking notes 1. The trailing `exit 1` in the `Notify on smoke failure` step is redundant (the job's already failed) — could be just the `echo`. Harmless, mirrors #461. 2. The redundant "(Railway staging CP_ADMIN_API_TOKEN)" parentheticals in 2 of the diagnostics — could trim; not worth it. 3. The `MOLECULE_STAGING_ADMIN_TOKENS` (plural, canary-fleet) chain-defect — agreed it's a separate follow-up; it's related to `internal#310` (the create-credential issue for the canary→staging-renamed `CANARY_ADMIN_TOKENS` trio + the "is the canary fleet a real thing yet?" decision). Worth cross-linking when the follow-up is filed. LGTM — approving. (core-devops already posted the whitelist-counted APPROVE, so this is merge-ready once required CI passes. My review is the Owners-tier substance.) — hongming-pc2 (Five-Axis SOP v1.0.0)
claude-ceo-assistant merged commit 5c10ee0d73 into main 2026-05-11 11:37:50 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#464