[handover] Add amazon-ecr-credential-helper to tenant AMI — fix 12h-stale-ECR-auth failure mode #658

Open
opened 2026-05-12 04:20:51 +00:00 by hongming · 1 comment
Owner

Discovered tonight (2026-05-12)

During manual promote validation (CP#135 RFC#279 dry-run), hongming tenant's redeploy 403'd on the first attempt:

Error response from daemon: unknown: failed to resolve reference "153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/platform-tenant:latest": 403 Forbidden

Root cause: /root/.docker/config.json on the tenant EC2 stores a static ECR auth token (12h TTL) baked at last docker login time. After expiry, all docker pull calls 403. amazon-ecr-credential-helper is NOT installed.

Worked around tonight by manually aws ecr get-login-password | docker login via SSM, then retried redeploy successfully.

Note: chloe-dong redeploy worked on first try because earlier in the session I'd already SSM'd a docker login refresh while debugging — coincidental, not by design.

Fix

Bake amazon-ecr-credential-helper into the tenant AMI + flip /root/.docker/config.json to use credHelpers:

{
  "credHelpers": {
    "153263036946.dkr.ecr.us-east-2.amazonaws.com": "ecr-login",
    "004947743811.dkr.ecr.us-east-2.amazonaws.com": "ecr-login"
  }
}

This auto-refreshes the ECR auth from the EC2 IAM role on every pull, eliminating the 12h-stale failure mode.

Scope

  • Update packer build for tenant AMI (workspace-server / platform-tenant)
  • Verify fresh AMI bake passes a docker pull :latest smoke test 13+h after bake
  • Roll new AMI into CP provisioner default
  • Existing live tenants: SSM rolling update to install helper + flip config.json (or accept attrition: next AMI rebake recovers)

Related

  • CP#135 (RFC#279 auto-update chain) — this is the EC2-side piece complementing the ECR-side auto-tag promotion
  • Memory: feedback_no_secrets_in_docker_cmd_args (cred-helper avoids leaking secrets in docker config)
  • Tonight's manual workaround SSM command:
    aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin <acct>.dkr.ecr.us-east-2.amazonaws.com
## Discovered tonight (2026-05-12) During manual promote validation (CP#135 RFC#279 dry-run), hongming tenant's redeploy 403'd on the first attempt: ``` Error response from daemon: unknown: failed to resolve reference "153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/platform-tenant:latest": 403 Forbidden ``` Root cause: `/root/.docker/config.json` on the tenant EC2 stores a static ECR auth token (12h TTL) baked at last `docker login` time. After expiry, all `docker pull` calls 403. `amazon-ecr-credential-helper` is NOT installed. Worked around tonight by manually `aws ecr get-login-password | docker login` via SSM, then retried redeploy successfully. Note: chloe-dong redeploy worked on first try because earlier in the session I'd already SSM'd a `docker login` refresh while debugging — coincidental, not by design. ## Fix Bake `amazon-ecr-credential-helper` into the tenant AMI + flip `/root/.docker/config.json` to use credHelpers: ```json { "credHelpers": { "153263036946.dkr.ecr.us-east-2.amazonaws.com": "ecr-login", "004947743811.dkr.ecr.us-east-2.amazonaws.com": "ecr-login" } } ``` This auto-refreshes the ECR auth from the EC2 IAM role on every pull, eliminating the 12h-stale failure mode. ## Scope - Update packer build for tenant AMI (workspace-server / platform-tenant) - Verify fresh AMI bake passes a `docker pull :latest` smoke test 13+h after bake - Roll new AMI into CP provisioner default - Existing live tenants: SSM rolling update to install helper + flip config.json (or accept attrition: next AMI rebake recovers) ## Related - CP#135 (RFC#279 auto-update chain) — this is the EC2-side piece complementing the ECR-side auto-tag promotion - Memory: `feedback_no_secrets_in_docker_cmd_args` (cred-helper avoids leaking secrets in docker config) - Tonight's manual workaround SSM command: `aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin <acct>.dkr.ecr.us-east-2.amazonaws.com`
triage-operator added the tier:medium label 2026-05-12 04:21:35 +00:00
hongming added tier:high and removed tier:medium labels 2026-05-12 04:25:00 +00:00
core-devops was assigned by hongming 2026-05-12 04:25:22 +00:00
Author
Owner

Cross-link to molecule-controlplane#135 (RFC#279 auto-update chain) — this issue is the EC2-side credential-helper half. CP#135 is the platform-orchestration half. Both must land for the publish→replicate→pin→redeploy chain to be fully automated; until then scripts/promote-tenant-image.sh (core#672, closes #660) is the manual fallback that ships the SSM-refresh workaround. Posted by orchestrator handover sweep 2026-05-12.

Cross-link to molecule-controlplane#135 (RFC#279 auto-update chain) — this issue is the EC2-side credential-helper half. CP#135 is the platform-orchestration half. Both must land for the publish→replicate→pin→redeploy chain to be fully automated; until then `scripts/promote-tenant-image.sh` (core#672, closes #660) is the manual fallback that ships the SSM-refresh workaround. Posted by orchestrator handover sweep 2026-05-12.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#658