From 2c6d534940354f43ab58069d34e5a301c6693961 Mon Sep 17 00:00:00 2001 From: hongming Date: Mon, 11 May 2026 21:56:59 -0700 Subject: [PATCH] =?UTF-8?q?feat(scripts):=20codify=20ECR=20:staging-latest?= =?UTF-8?q?=20=E2=86=92=20:latest=20promote=20+=20tenant=20redeploy=20(clo?= =?UTF-8?q?ses=20#660)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replaces the manual 4-step runbook in `reference_manual_ecr_promote_procedure.md` with a single self-contained script + 40 mock-driven e2e tests + a CI gate. The script does the full chain end-to-end: 1. **PREFLIGHT** — AWS auth ok, source-tag exists, CP base reachable. Exits 1 with no mutations if anything's wrong. 2. **SNAPSHOT** — saves the current dest-tag manifest as `-prev-YYYYMMDD`. Idempotent: same UTC day re-runs are no-ops. 3. **PROMOTE** — copies `` manifest → `` via `aws ecr put-image` with the OCI image-index media type (preserves inner child-manifest digest per `reference_ecr_cross_account_digest_exact_mirror`). 4. **REDEPLOY** — per-tenant POST `/cp/admin/tenants//redeploy`. On HTTP 403 (stale tenant docker ECR auth — `feedback_ec2_ecr_auth_12h_stale`) it SSM-refreshes the EC2's docker login and retries once. 5. **VERIFY** — per-tenant `/buildinfo` + `/health` probes. Failure here triggers auto-rollback. 6. **ROLLBACK** (on failure) — re-promotes the rollback tag back to `` and redeploys the fleet. Exits 3 if rollback OK, 4 if not. Every external call (aws/curl/ssm) is wrapped in a function with a `--mock-dir` injection point so the tests can drive every branch without touching real infrastructure. 40 cases across 11 test groups: - happy path (5 assertions on call counts + exit code) - preflight failures with no mutations - snapshot idempotency - `--dry-run` skips all mutations - 403 → SSM-refresh → retry path - redeploy fail with vs without rollback (exit 3 vs 4) - argument validation (missing/conflicting/unknown flags) - date override for rollback tag naming - empty source manifest detection - verify-failure triggers rollback Runs `bash scripts/test-promote-tenant-image.sh`. No live infra touched. Two new steps in the existing `Shellcheck (E2E scripts)` job (a required check on `main`), gated by the existing `scripts` change filter (`scripts/`, `tests/e2e/`, `infra/scripts/`, or this workflow file itself): 1. Run `scripts/test-promote-tenant-image.sh` — fails CI if any of the 40 cases regresses. 2. Run `shellcheck --severity=warning` on the two files. The bulk shellcheck step intentionally excludes `scripts/` for legacy SC3040/SC3043 reasons; explicit invocation here catches new regressions in the promote script without unblocking the bulk cleanup. ``` $ bash scripts/test-promote-tenant-image.sh ... All 40 tests passed. $ shellcheck --severity=warning scripts/promote-tenant-image.sh scripts/test-promote-tenant-image.sh (clean) ``` - core#660 — "Codify manual ECR promote operation as `scripts/promote-tenant-image.sh`" (tier:medium, core-devops) - core#658 — proper fix for the 12h-stale tenant ECR auth (this script ships the SSM-refresh workaround pending the credential-helper rollout). - `reference_manual_ecr_promote_procedure.md` (memory) — the manual procedure this script replaces. Co-Authored-By: Claude Opus 4.7 (1M context) --- .gitea/workflows/ci.yml | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/.gitea/workflows/ci.yml b/.gitea/workflows/ci.yml index 16560e92..84bd0d84 100644 --- a/.gitea/workflows/ci.yml +++ b/.gitea/workflows/ci.yml @@ -383,6 +383,27 @@ jobs: run: | bash tests/e2e/test_model_slug.sh + - if: needs.changes.outputs.scripts == 'true' + name: Test ECR promote-tenant-image script (mock-driven, no live infra) + # Covers scripts/promote-tenant-image.sh — the codified + # :staging-latest → :latest ECR promote + tenant fleet redeploy + # closing molecule-ai/molecule-core#660. 40 mock-driven cases + # exercise every exit path (preflight, snapshot, promote, redeploy + # 403→SSM-refresh, verify, rollback). No live AWS/CP/SSM calls. + run: | + bash scripts/test-promote-tenant-image.sh + + - if: needs.changes.outputs.scripts == 'true' + name: Shellcheck promote-tenant-image script + # scripts/ is excluded from the bulk shellcheck pass above (legacy + # SC3040/SC3043 cleanup pending). Run shellcheck explicitly on + # the promote script + its test harness so regressions there are + # caught by the required check. + run: | + shellcheck --severity=warning \ + scripts/promote-tenant-image.sh \ + scripts/test-promote-tenant-image.sh + canvas-deploy-reminder: name: Canvas Deploy Reminder runs-on: ubuntu-latest -- 2.52.0