fix(ci): replace buildx with plain docker build+push (followup #173 )

CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR push 401 either: buildx CLI inside the runner container talks to the operator-host docker daemon (mounted socket), but the daemon doesn't see the runner's ECR auth state, and the runner's buildx CLI doesn't attach the auth header in a way the daemon accepts. Drop buildx + build-push-action entirely. Plain `docker build` + `docker push` from the runner container works because both use the SAME docker socket + the SAME runner-container config.json (populated by `aws ecr get-login-password | docker login` from amazon-ecr-login). Trade-off: lose multi-arch support. We only ship linux/amd64 tenant images today, so this is fine. If multi-arch becomes a requirement later, we can revisit (likely with `docker buildx create --driver=remote` pointing at an external buildkit, but that's substantial infra work; not worth it for a single-arch shop). Closes #173 (fourth piece — and hopefully last; this matches the operator-host manual approach exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix(ci): use docker driver for buildx + drop type=gha cache (#43 )
2026-05-07 13:43:50 -07:00 · 2026-05-07 20:36:01 +00:00 · 2026-05-07 13:35:07 -07:00 · 2026-05-07 20:27:08 +00:00 · 2026-05-07 20:23:33 +00:00 · 2026-05-07 13:04:57 -07:00
1 changed files with 70 additions and 55 deletions
@@ -94,8 +94,19 @@ jobs:
        id: ecr-login
        uses: aws-actions/amazon-ecr-login@v2

-      - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@4d04d5d9486b7bd6fa91e7baf45bbb4f8b9deedd # v4.0.0
+      # docker/setup-buildx-action removed (Task #173, 2026-05-07).
+      # Reason: on Gitea Actions, neither buildx driver works for our
+      # mounted-docker-socket runner topology:
+      #   - docker-container driver: spawns a buildkit container that
+      #     doesn't share the host's ECR auth (401 Unauthorized on push)
+      #   - docker driver: delegates to the operator-host docker daemon,
+      #     which doesn't see the runner container's ECR auth either
+      # Plain `docker build` + `docker push` from the runner container
+      # works because both use the same docker socket + the runner's
+      # config.json (populated by `aws ecr get-login-password | docker
+      # login` in the next step). Buildx's value here was only multi-arch
+      # builds, but we only ship linux/amd64 tenant images, so the
+      # complexity isn't earning anything.

      - name: Compute tags
        id: tags
@@ -177,58 +188,62 @@ jobs:
      # that gap. Earlier 2026-04-24 incident: a static :staging-<sha> pin
      # drifted 10 days behind staging — same class of bug, different
      # mechanism.
-      - name: Build & push platform image to GHCR (staging-<sha> + staging-latest)
-        uses: docker/build-push-action@bcafcacb16a39f128d818304e6c9c0c18556b85f # v7.1.0
-        with:
-          context: .
-          file: ./workspace-server/Dockerfile
-          platforms: linux/amd64
-          push: true
-          tags: |
-            ${{ env.IMAGE_NAME }}:staging-${{ steps.tags.outputs.sha }}
-            ${{ env.IMAGE_NAME }}:staging-latest
-          cache-from: type=gha
-          cache-to: type=gha,mode=max
-          # GIT_SHA bakes into the Go binary via -ldflags so /buildinfo
-          # returns it at runtime — see Dockerfile + buildinfo/buildinfo.go.
-          # This is the same value as the OCI revision label below; passing
-          # it twice is intentional, the OCI label is for registry tooling
-          # while /buildinfo is for the redeploy verification step.
-          build-args: |
-            GIT_SHA=${{ github.sha }}
-          labels: |
-            org.opencontainers.image.source=https://github.com/${{ github.repository }}
-            org.opencontainers.image.revision=${{ github.sha }}
-            org.opencontainers.image.description=Molecule AI platform (Go API server) — pending canary verify
+      # Build + push platform image with plain `docker` (no buildx).
+      # GIT_SHA bakes into the Go binary via -ldflags so /buildinfo
+      # returns it at runtime — see Dockerfile + buildinfo/buildinfo.go.
+      # The OCI revision label below carries the same value for registry
+      # tooling; the duplication is intentional.
+      - name: Build & push platform image to ECR (staging-<sha> + staging-latest)
+        env:
+          IMAGE_NAME: ${{ env.IMAGE_NAME }}
+          TAG_SHA: staging-${{ steps.tags.outputs.sha }}
+          TAG_LATEST: staging-latest
+          GIT_SHA: ${{ github.sha }}
+          REPO: ${{ github.repository }}
+        run: |
+          set -euo pipefail
+          docker build \
+            --file ./workspace-server/Dockerfile \
+            --build-arg GIT_SHA="${GIT_SHA}" \
+            --label "org.opencontainers.image.source=https://github.com/${REPO}" \
+            --label "org.opencontainers.image.revision=${GIT_SHA}" \
+            --label "org.opencontainers.image.description=Molecule AI platform (Go API server) — pending canary verify" \
+            --tag "${IMAGE_NAME}:${TAG_SHA}" \
+            --tag "${IMAGE_NAME}:${TAG_LATEST}" \
+            .
+          docker push "${IMAGE_NAME}:${TAG_SHA}"
+          docker push "${IMAGE_NAME}:${TAG_LATEST}"

-      - name: Build & push tenant image to GHCR (staging-<sha> + staging-latest)
-        uses: docker/build-push-action@bcafcacb16a39f128d818304e6c9c0c18556b85f # v7.1.0
-        with:
-          context: .
-          file: ./workspace-server/Dockerfile.tenant
-          platforms: linux/amd64
-          push: true
-          tags: |
-            ${{ env.TENANT_IMAGE_NAME }}:staging-${{ steps.tags.outputs.sha }}
-            ${{ env.TENANT_IMAGE_NAME }}:staging-latest
-          cache-from: type=gha
-          cache-to: type=gha,mode=max
-          # Canvas uses same-origin fetches. The tenant Go platform
-          # reverse-proxies /cp/* to the SaaS CP via its CP_UPSTREAM_URL
-          # env; the tenant's /canvas/viewport, /approvals/pending,
-          # /org/templates etc. live on the tenant platform itself.
-          # Both legs share one origin (the tenant subdomain) so
-          # PLATFORM_URL="" forces canvas to fetch paths as relative,
-          # which land same-origin.
-          #
-          # Self-hosted / private-label deployments override this at
-          # build time with a specific backend (e.g. local dev:
-          # NEXT_PUBLIC_PLATFORM_URL=http://localhost:8080).
-          build-args: |
-            NEXT_PUBLIC_PLATFORM_URL=
-            GIT_SHA=${{ github.sha }}
-          labels: |
-            org.opencontainers.image.source=https://github.com/${{ github.repository }}
-            org.opencontainers.image.revision=${{ github.sha }}
-            org.opencontainers.image.description=Molecule AI tenant platform + canvas — pending canary verify
+      # Canvas uses same-origin fetches. The tenant Go platform
+      # reverse-proxies /cp/* to the SaaS CP via its CP_UPSTREAM_URL
+      # env; the tenant's /canvas/viewport, /approvals/pending,
+      # /org/templates etc. live on the tenant platform itself.
+      # Both legs share one origin (the tenant subdomain) so
+      # PLATFORM_URL="" forces canvas to fetch paths as relative,
+      # which land same-origin.
+      #
+      # Self-hosted / private-label deployments override this at
+      # build time with a specific backend (e.g. local dev:
+      # NEXT_PUBLIC_PLATFORM_URL=http://localhost:8080).
+      - name: Build & push tenant image to ECR (staging-<sha> + staging-latest)
+        env:
+          TENANT_IMAGE_NAME: ${{ env.TENANT_IMAGE_NAME }}
+          TAG_SHA: staging-${{ steps.tags.outputs.sha }}
+          TAG_LATEST: staging-latest
+          GIT_SHA: ${{ github.sha }}
+          REPO: ${{ github.repository }}
+        run: |
+          set -euo pipefail
+          docker build \
+            --file ./workspace-server/Dockerfile.tenant \
+            --build-arg NEXT_PUBLIC_PLATFORM_URL= \
+            --build-arg GIT_SHA="${GIT_SHA}" \
+            --label "org.opencontainers.image.source=https://github.com/${REPO}" \
+            --label "org.opencontainers.image.revision=${GIT_SHA}" \
+            --label "org.opencontainers.image.description=Molecule AI tenant platform + canvas — pending canary verify" \
+            --tag "${TENANT_IMAGE_NAME}:${TAG_SHA}" \
+            --tag "${TENANT_IMAGE_NAME}:${TAG_LATEST}" \
+            .
+          docker push "${TENANT_IMAGE_NAME}:${TAG_SHA}"
+          docker push "${TENANT_IMAGE_NAME}:${TAG_LATEST}"
Author	SHA1	Message	Date
devops-engineer	43e2d24c5b	fix(ci): replace buildx with plain docker build+push (followup #173 ) CI run #946 (post-#43) confirmed `driver: docker` doesn't fix the ECR push 401 either: buildx CLI inside the runner container talks to the operator-host docker daemon (mounted socket), but the daemon doesn't see the runner's ECR auth state, and the runner's buildx CLI doesn't attach the auth header in a way the daemon accepts. Drop buildx + build-push-action entirely. Plain `docker build` + `docker push` from the runner container works because both use the SAME docker socket + the SAME runner-container config.json (populated by `aws ecr get-login-password \| docker login` from amazon-ecr-login). Trade-off: lose multi-arch support. We only ship linux/amd64 tenant images today, so this is fine. If multi-arch becomes a requirement later, we can revisit (likely with `docker buildx create --driver=remote` pointing at an external buildkit, but that's substantial infra work; not worth it for a single-arch shop). Closes #173 (fourth piece — and hopefully last; this matches the operator-host manual approach exactly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:43:50 -07:00
claude-ceo-assistant	0b840df563	fix(ci): use docker driver for buildx + drop type=gha cache (#43 ) Closes #173 — third and final piece. Pairs with #38 and #41.	2026-05-07 20:36:01 +00:00
devops-engineer	bee4f9ea79	fix(ci): use docker driver for buildx + drop type=gha cache (followup #173 ) PR #38 + #41 fixed the Dockerfile-side clone issue. CI run #893 then revealed two Gitea-Actions-specific issues with the unchanged buildx config: 1. `failed to push: 401 Unauthorized` to ECR. Root cause: default buildx driver `docker-container` spawns a buildkit container that doesn't share the host's `~/.docker/config.json`, so the ECR auth set up by amazon-ecr-login doesn't reach the push. Fix: pin `driver: docker` so buildx delegates to the host daemon, which already has the ECR creds. 2. `dial tcp ...:41939: i/o timeout` on `_apis/artifactcache/cache`. Root cause: `cache-from/cache-to: type=gha` is GitHub-specific; Gitea Actions has no compatible artifact-cache backend, so every cache lookup fails after a 30s timeout. Fix: remove the cache-* options. Cold-build cost is <10min for 37-repo clone + Go/Node compile, acceptable. Could revisit with type=registry inline cache later if rebuilds get painful. With this + #38/#41, the workflow should run end-to-end on Gitea Actions: pre-clone -> docker build (host daemon) -> ECR push. Closes #173 (third and final piece). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:35:07 -07:00
claude-ceo-assistant	c1e32ff4a7	Merge pull request 'fix(test): drain coalesceRestart goroutines before t.Cleanup (Class H, #170 )' (#39 ) from fix/170-goroutine-bleed-test-isolation into main	2026-05-07 20:27:08 +00:00
claude-ceo-assistant	bac04dc278	fix(ci): apply pre-clone fix to platform Dockerfile too (#41 ) Closes #173 — followup to #38.	2026-05-07 20:23:33 +00:00
Hongming Wang	694c05552b	fix(test): drain coalesceRestart goroutines before t.Cleanup (Class H, #170 ) TestPooledWithEICTunnel_PreservesFnErr (and any sqlmock-using neighbour test) was at risk of inheriting stale INSERT calls from a previous test's coalesceRestart goroutine that survived its t.Cleanup boundary. The production callsite shape is `go h.RestartByID(...)` from a2a_proxy.go, a2a_proxy_helpers.go and main.go. When that goroutine's runRestartCycle panics, coalesceRestart's deferred recover swallows it to keep the platform process alive — but in tests, nothing waits for the goroutine to fully exit. If it's still draining LogActivity-shaped work after the test returns, those INSERTs land in the next test's sqlmock connection as kind=DELEGATION_FAILED / kind=WORKSPACE_PROVISION_FAILED, surfacing as "INSERT-not-expected". Fix: introduce drainCoalesceGoroutine(t, wsID, cycle) test helper that spawns coalesceRestart on a goroutine (matching production) and registers a t.Cleanup with sync.WaitGroup.Wait so the test can't declare itself done while a goroutine is still alive. Convert TestCoalesceRestart_PanicInCycleClearsState to use the helper (previously it called coalesceRestart synchronously, which never exercised the production goroutine-survival contract). Add TestCoalesceRestart_DrainHelperWaitsForGoroutineExit as the regression guard: cycle blocks 150ms then panics; the test asserts t.Run elapsed >= 150ms (proving the Wait barrier engaged) AND the deferred close ran (proving the panic-recovery defer chain executed) AND state.running was cleared. Verified the assertion is real by mutation-testing: removing t.Cleanup(wg.Wait) makes this test FAIL deterministically with elapsed <300µs. Per saved memory feedback_assert_exact_not_substring: the regression test asserts an exact-shape contract (elapsed >= blockFor) rather than a substring-in-output, so it discriminates between "drain works" and "drain skipped". Per Phase 3: 10/10 race-detector runs pass for all TestCoalesceRestart_* tests. Full ./internal/handlers/... suite green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 13:04:57 -07:00