Compare commits

..

26 Commits

Author SHA1 Message Date
devops-engineer 47c0a8c903 fix(eic-tunnel-pool): port race fix b6646910 from staging
Sister-agent investigation of PR #99's Platform (Go) regression identified
that PR #99's head pre-dates b6646910 (fix(eic-tunnel-pool): capture
poolJanitorInterval at pool construction).

Without this commit, TestPooledWithEICTunnel_PanicPoisonsEntry fails under
-race on main when the cherry-pick lands. Including it in this cherry-pick
to keep main Platform (Go) green.

Verified locally by sister agent: 3-of-3 reproducible race FAIL without fix,
3-of-3 PASS with fix. 2-of-2 full handlers go test -race PASS with this
file from staging tip.

Co-authored-by: Claude (orchestrator)
2026-05-08 02:26:52 +00:00
devops-engineer 570f456436 fix(ci): port 3 verified-green CI fixes from staging to main
Files copied from staging tip (a4ab623b):
- canvas/vitest.config.ts          (vitest testTimeout 5000→30000ms on CI; was PR #97 unblocking 4 canvas-test PRs)
- .github/workflows/handlers-postgres-integration.yml  (parallel-safe pg containers; was PR #98 unblocking #84)
- .github/workflows/e2e-api.yml    (parallel-safe pg+redis; was PR #100 unblocking #84 + #99)
- docs/runbooks/handlers-postgres-integration-port-collision.md (new — substrate runbook)

Why a separate cherry-pick PR rather than promote-staging-to-main: PR #99 (full
staging→main promote) hit a Platform (Go) sqlmock regression in some other
staging commit (under Phase 1 investigation by sister agent a283d938). To
unblock prod with the verified-green CI fixes WITHOUT carrying the Go-test
regression, port these 3 workflow/config files surgically.

Verified clean: workflow-YAML + vitest config + runbook only — zero Go code
touched, so the Platform (Go) failure on PR #99 cannot apply here.

Co-authored-by: Claude (orchestrator)
2026-05-08 02:19:01 +00:00
claude-ceo-assistant 068c968206 docs(hermes): hermes-agent fork moved to Gitea (#90)
Doc update reflecting #160 hermes-agent migration. Approved by security-auditor.
2026-05-08 01:17:03 +00:00
claude-ceo-assistant 97c042f666 Merge branch 'main' into fix/hermes-agent-doc-gitea-migration 2026-05-08 00:54:30 +00:00
claude-ceo-assistant 3d6303afcc fix(ci): rewrite retarget-main-to-staging for Gitea REST API (#79)
Closes #74. Approved by security-auditor.
2026-05-08 00:26:27 +00:00
claude-ceo-assistant 3fcaa1fcc5 Merge branch 'main' into fix/hermes-agent-doc-gitea-migration 2026-05-08 00:21:17 +00:00
claude-ceo-assistant 6c823cf673 Merge branch 'main' into fix/196-retarget-main-to-staging-gitea-rest 2026-05-08 00:20:49 +00:00
claude-ceo-assistant 4193d54852 fix(ci): pin actions/upload-artifact + download-artifact to @v3 (#89)
Closes #210. Unblocks 5 stuck PRs (#53/#54/#69/#71/#76/#81). Approved by security-auditor.
2026-05-08 00:20:00 +00:00
devops-engineer 0bcf195fbc docs(hermes): hermes-agent fork moved to Gitea (post-suspension)
The `HongmingWang-Rabbit/hermes-agent` fork is no longer reachable on
github.com (account suspended 2026-05-06). The patched fork now lives
at https://git.moleculesai.app/molecule-ai/hermes-agent. Same SHAs,
same branches — pure URL flip.

See molecule-ai/internal#72 for the github.com fork shell decision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:57:57 -07:00
devops-engineer 8885f7cd12 fix(ci): pin actions/upload-artifact + download-artifact to @v3 for Gitea compatibility
actions/upload-artifact@v4+ and download-artifact@v4+ use the GHES 3.10+
artifact protocol that Gitea Actions (act_runner v0.6 / Gitea 1.22.x)
does NOT implement. Failure cite from PR #54 run 1325 jobs/2:

  ::error::@actions/artifact v2.0.0+, upload-artifact@v4+ and
  download-artifact@v4+ are not currently supported on GHES.

Pinned all 3 references to v3.2.2 (latest v3) at SHA-pinned form for
supply-chain hygiene, matching the existing `uses:` style in this repo.
Affected workflows:
  - ci.yml (Canvas Next.js coverage upload, blocks `CI / Canvas (Next.js)`
    required check on every PR — was the merge-queue blocker for #53,
    #54, #69, #71, #76, #81)
  - e2e-staging-canvas.yml (Playwright report + screenshots on failure)

No download-artifact callers in the repo, so v3-pin doesn't compose-break
anywhere. Drop these pins post-Gitea-1.23+ when the v4 artifact protocol
ships, or migrate to a Gitea-native action.

Closes #210.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:54:44 -07:00
claude-ceo-assistant cdbf28fd76 ci(canary): synthetic-check cron for AUTO_SYNC_TOKEN rotation drift (#77)
6h cron probes auth + scope + git-push --dry-run. Closes #72. Approved by security-auditor.
2026-05-07 23:45:25 +00:00
claude-ceo-assistant 4b82db72a7 Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2 2026-05-07 23:44:22 +00:00
claude-ceo-assistant 07bd91e436 fix(ci): replace gh run list with Gitea commit-status query (#83)
Class F of #75 sweep. /commits/{sha}/statuses replaces unavailable workflow-runs API. 4 mapping buckets verified against synthetic+real Gitea data. Approved by security-auditor.
2026-05-07 23:44:21 +00:00
claude-ceo-assistant ed0874504e Merge branch 'main' into fix/issue75-class-F-gh-run-list-to-statuses 2026-05-07 23:44:00 +00:00
claude-ceo-assistant fa27611e9c Merge branch 'main' into fix/196-retarget-main-to-staging-gitea-rest 2026-05-07 23:02:10 +00:00
claude-ceo-assistant ca644134f2 Merge branch 'main' into fix/196-retarget-main-to-staging-gitea-rest 2026-05-07 22:54:31 +00:00
claude-ceo-assistant d81fb98163 Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2 2026-05-07 22:53:32 +00:00
claude-ceo-assistant 4d5c9a6646 Merge branch 'main' into fix/issue75-class-F-gh-run-list-to-statuses 2026-05-07 22:53:26 +00:00
claude-ceo-assistant 2b3a8f2e4d Merge branch 'main' into fix/196-retarget-main-to-staging-gitea-rest 2026-05-07 22:52:35 +00:00
claude-ceo-assistant 85140f1c72 Merge branch 'main' into fix/issue-72-auto-sync-token-canary-v2 2026-05-07 22:40:56 +00:00
devops-engineer 5b3ce5c818 fix(ci): replace gh run list with Gitea commit-status query (#75 class F)
Part of the post-#66 sweep to remove `gh` CLI dependencies that fail
silently against Gitea. Class F covers `gh run list --workflow=X
--commit=SHA` shapes — querying whether a specific workflow ran (and
how it finished) for a specific SHA.

Why this is the only call site in class F:

`gh run list` hits GitHub's `/repos/.../actions/runs` REST endpoint.
Gitea exposes ZERO endpoints under `/repos/.../actions/runs` —
verified 2026-05-07 via swagger inspection: only secrets, variables,
and runner-registration tokens live under /actions/. There's no way
to query workflow run state via the Gitea v1 API directly.

However, every Gitea Actions job DOES emit a commit status with
`context = "<Workflow Name> / <Job Name> (<event>)"` (verified
2026-05-07 by reading /repos/.../commits/{sha}/statuses on a recent
main SHA). That surface is exactly what we need: each workflow run
leg is one status row, the aggregate state encodes the run outcome,
and Gitea exposes it under `/api/v1/repos/.../commits/{sha}/statuses`
which IS available.

Affected:

`auto-promote-on-e2e.yml` (lines 172-180):
  Old: `gh run list --workflow e2e-staging-saas.yml --commit $SHA
       --json status,conclusion --jq ...` returning a 5-bucket string
       like `completed/success` | `in_progress/none` | `none/none` |
       `completed/failure` | `completed/cancelled`.
  New: `curl /api/v1/repos/.../commits/$SHA/statuses` + jq filter on
       contexts whose name starts with
       `"E2E Staging SaaS (full lifecycle) /"`. Mapping:
         0 matched contexts          → "none/none"      (E2E paths-
                                                          filtered out
                                                          — same as
                                                          before)
         any context = pending       → "in_progress/none" (defer)
         any context = error|failure → "completed/failure" (abort)
         all contexts = success      → "completed/success" (proceed)
  The `completed/cancelled` arm of the case statement becomes
  unreachable: Gitea status API doesn't expose a `cancelled` state
  (it has success/failure/error/pending/warning), so per-SHA
  concurrency cancellations now surface as `failure` and are handled
  by the failure branch. Documented in-place; the cancelled arm is
  kept as defense-in-depth for any future dual-host operation.

Verification:

- Live curl against the current main SHA returns `none/none` (E2E
  was paths-filtered for that change set — expected).
- Synthetic-input jq tests verify all four mapping buckets:
    no contexts                 → "none/none"
    one context = pending       → "in_progress/none"
    success + success           → "completed/success"
    success + failure           → "completed/failure"
- YAML syntax validates.

Token: continues to use act_runner's GITHUB_TOKEN (per-run, repo
read scope). The `/commits/{sha}/statuses` endpoint is repo-scoped,
no extra perms needed.

Closes part of #75. Master tracking issue at #75; companion PRs:
#80 (class A — `gh pr ...`), #81 (class D — `gh api ...`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:38:57 -07:00
claude-ceo-assistant e4e1bf4080 ci(canary): annotate EXPECTED_PERSONA dual-update constraint
Hostile-self-review weakest-spot #2: if the devops-engineer persona
is ever renamed, the canary will go red even if everything else is
fine. Add an inline comment pointing the next editor at both files
that must update together (auto-sync-main-to-staging.yml's git
config + this canary's EXPECTED_PERSONA + the staging branch
protection's push_whitelist_usernames).

No behaviour change — comment-only.
2026-05-07 15:35:22 -07:00
claude-ceo-assistant 62629eda4a ci(canary): rewrite Probe 3 to actually validate auth (NOP push --dry-run)
While verifying Phase 4, found a real flaw in Probe 3 (`git ls-remote
refs/heads/staging`). On a public repo (which molecule-core is), Gitea
falls back to anonymous read on bad auth, so `ls-remote` succeeds even
with a junk token. The probe was therefore green-lighting rotated
tokens — false-green, the worst possible canary failure mode.

Rewritten to use `git push --dry-run` of the current staging SHA back
to `refs/heads/staging`:

- Push always authenticates (auth-gated on smart-protocol handshake,
  before the dry-run can compute the empty-diff).
- NOP by construction: pushing the current tip back to itself is
  "Everything up-to-date" with exit 0.
- Bad token → "Authentication failed", exit 128.
- Doesn't reach pre-receive (where branch-protection authz runs), so
  scope is "auth only" — matches the design intent (failure mode B);
  authz already covered daily by branch-protection-drift.yml.

Implementation note: `git push` requires a local repo. Spinning up a
fresh `git init` in a tempdir (~1KB, ~50ms) instead of pulling the
full repo via actions/checkout — actions/checkout would clone
~hundreds of MB for what amounts to "a place to run git from."

Local mutation tests pass:
- Real token: "Everything up-to-date" exit 0
- Junk token: "Authentication failed" exit 128 with actionable
  ::error:: messages pointing at the runbook

Header comment + runbook step-mapping updated to reflect new probe
shape. Refs: #72
2026-05-07 15:34:34 -07:00
devops-engineer fab65c78d6 fix(ci): rewrite retarget-main-to-staging for Gitea REST API
Root cause: same as #65/#73 — gh CLI calls Gitea GraphQL
(/api/graphql) which returns HTTP 405. Specifically:
- gh api -X PATCH /pulls/{N} sometimes works but is flaky on
  Gitea (depends on gh's host-resolution layer)
- gh pr close / gh pr comment route through GraphQL → 405

Fix: replace all gh calls with direct curl REST calls to Gitea:
- PATCH /api/v1/repos/{owner}/{repo}/pulls/{index} body
  {"base": "staging"} — retarget the PR base
- POST /api/v1/repos/{owner}/{repo}/issues/{index}/comments —
  post the explainer comment (PRs are issues in Gitea, comments
  share the issue endpoint)
- PATCH /api/v1/repos/{owner}/{repo}/pulls/{index} body
  {"state": "closed"} — close redundant PR for #1884 case

Identity: switch from secrets.GITHUB_TOKEN (per-job ephemeral,
narrow scope on Gitea) to secrets.AUTO_SYNC_TOKEN (devops-engineer
persona). Same persona used by auto-sync (#66) and auto-promote
(#78). Per feedback_per_agent_gitea_identity_default. PR-edit and
comment do not need branch-protection bypass.

Curl-status-capture pattern hardened per
feedback_curl_status_capture_pollution: http_code via -w to its
own scalar, body to a tempfile, set +e/-e bracket so curl's
non-zero-on-4xx doesn't pollute the script's exit chain.

Header comment block fully rewritten with 4 failure-mode runbooks
(A: 422 dup-base, B: token rotated, C: PR deleted, D: filter
mis-fire) per PR #66/#78's pattern.

Refs: #65, #74, #196, PR #66 + #78 (canonical reference)
Closes #74

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 15:28:26 -07:00
claude-ceo-assistant 0cef033a6a ci(canary): route curl -w to tempfile to satisfy status-capture lint
The two API probes used the unsafe shape rejected by
lint-curl-status-capture.yml (per feedback_curl_status_capture_pollution):

  status=$(curl ... -w '%{http_code}' ... || echo "000")

When curl exits non-zero (transport error, --fail-with-body 4xx/5xx),
the `-w` already wrote a code; the `|| echo "000"` then APPENDS another
"000", yielding "000000" or "409000" — passes shape checks while looking
right.

Switch to the canonical safe shape (set +e + tempfile + cat):

  set +e
  curl ... -w '%{http_code}' >code_file 2>/dev/null
  set -e
  status=$(cat code_file 2>/dev/null || true)
  [ -z "$status" ] && status="000"

Inline comment in both probe steps explains the lint constraint so
the next editor doesn't re-introduce the bad pattern.

Refs: #72, lint failure on PR #77 (1/22 red → 22/22 expected green)
2026-05-07 15:26:22 -07:00
claude-ceo-assistant bfc393c065 ci: add AUTO_SYNC_TOKEN rotation drift canary (#72)
Adds a 6h-cron synthetic check that fires the auth surface used by
auto-sync-main-to-staging.yml (PR #66) and emits a red workflow
status when AUTO_SYNC_TOKEN has drifted out of validity. Closes
hostile-self-review weakest-spot #3 from PR #66 (token-rotation
detection latency).

Read-only verification — no writes, no synthetic merge commits, no
canary branch noise. Three probes:
  1. GET /api/v1/user → token authenticates as devops-engineer
  2. GET /api/v1/repos/molecule-ai/molecule-core → read:repository scope
  3. git ls-remote refs/heads/staging → exact HTTPS auth path used by
     actions/checkout in the real auto-sync workflow

Hard-fail on missing AUTO_SYNC_TOKEN secret on both schedule and
workflow_dispatch — per feedback_schedule_vs_dispatch_secrets_hardening,
a silent soft-skip would make the canary itself drift-invisible (the
sweep-cf-orphans #2088 lesson). Operator runbook in workflow header.

Token reuse: same AUTO_SYNC_TOKEN as the workflow under monitor; no
new credential introduced. Read-only paths only.

Refs: #72, hostile-self-review #66
2026-05-07 15:23:03 -07:00
12 changed files with 2043 additions and 132 deletions
+72 -34
View File
@@ -154,30 +154,71 @@ jobs:
exit 0
fi
# Upstream is publish-workspace-server-image. Check E2E state.
# The jq filter must defend against TWO empty cases that gh
# CLI emits indistinguishably:
# 1. gh exits non-zero (network blip, auth issue) → handled
# by the `|| echo "none/none"` fallback below.
# 2. gh exits zero but returns `[]` (no E2E run on this
# main SHA — the common case for canvas-only / cmd-only
# / sweep-only changes whose paths don't trigger E2E).
# Without `(.[0] // {})`, jq sees `null` and emits
# "null/none" — which the case statement below has no
# branch for, so it falls into *) → exit 1.
# Surfaced 2026-04-30 the first time the App-token chain
# (#2389) actually fired auto-promote-on-e2e from a publish
# upstream — every prior run was E2E-upstream which
# short-circuits before this gate.
RESULT=$(gh run list \
--repo "$REPO" \
--workflow e2e-staging-saas.yml \
--branch main \
--commit "$SHA" \
--limit 1 \
--json status,conclusion \
--jq '(.[0] // {}) | "\(.status // "none")/\(.conclusion // "none")"' \
2>/dev/null || echo "none/none")
# Upstream is publish-workspace-server-image. Check E2E state
# for the same SHA via Gitea's commit-status API.
#
# GitHub-era this was `gh run list --workflow=X --commit=SHA
# --json status,conclusion` returning either `[]` (no run on
# this SHA) or `[{status, conclusion}]` (the run's state).
# Gitea has NO workflow-runs API at all — `/api/v1/repos/.../
# actions/runs` returns 404 (verified 2026-05-07, issue #75).
# However Gitea Actions DOES emit a commit status per workflow
# job, with `context = "<Workflow Name> / <Job Name> (<event>)"`,
# which is exactly what we need: each E2E run leg becomes one
# status row on the SHA, and the aggregate state encodes the
# run's outcome.
#
# Mapping:
# 0 matched contexts → "none/none" (E2E paths-
# filtered
# out — same
# semantic
# as before)
# any context = pending → "in_progress/none" (defer)
# any context = error|failure → "completed/failure" (abort)
# all contexts = success → "completed/success" (proceed)
#
# The "completed/cancelled" and "completed/timed_out" buckets
# don't have direct Gitea analogs (Gitea statuses are
# success / failure / error / pending / warning). Per-SHA
# concurrency cancellation surfaces as `error` on Gitea, which
# we map to "completed/failure" rather than "completed/cancelled"
# — losing the soft-defer semantic of the cancelled bucket on
# this fleet. Tradeoff: the staleness alarm (auto-promote-stale-
# alarm.yml) still catches a stuck :latest within 4h, and a
# legitimate cancel is rare enough that aborting + manual
# re-dispatch is acceptable. If we measure cancel frequency
# > 1/week, revisit by reading the run-step-summary text via
# a follow-up script.
#
# Network or auth blips collapse to "none/none" via the curl
# `|| true` fallback, matching the pre-Gitea behaviour where
# an empty list also degenerated to none/none.
GITEA_API_URL="${GITHUB_SERVER_URL:-https://git.moleculesai.app}/api/v1"
STATUSES_JSON=$(curl --fail-with-body -sS \
-H "Authorization: token ${GH_TOKEN}" \
-H "Accept: application/json" \
"${GITEA_API_URL}/repos/${REPO}/commits/${SHA}/statuses?limit=100" \
2>/dev/null || echo "[]")
RESULT=$(printf '%s' "$STATUSES_JSON" | jq -r '
# Filter to E2E Staging SaaS (full lifecycle) statuses.
# Match by leading workflow-name prefix so the "<job>
# (<event>)" tail is irrelevant. Gitea emits the workflow
# name verbatim from the YAML `name:` field.
[.[] | select(.context | startswith("E2E Staging SaaS (full lifecycle) /"))] as $rows
| if ($rows | length) == 0 then
"none/none"
elif any($rows[]; .status == "pending") then
"in_progress/none"
elif any($rows[]; .status == "failure" or .status == "error") then
"completed/failure"
elif all($rows[]; .status == "success") then
"completed/success"
else
# Mixed / unknown — fall through to *) bucket below.
"completed/" + ($rows[0].status // "unknown")
end
' 2>/dev/null || echo "none/none")
echo "E2E Staging SaaS for ${SHA:0:7}: $RESULT"
@@ -199,16 +240,13 @@ jobs:
exit 1
;;
completed/cancelled)
# cancelled ≠ failure. Per-SHA concurrency cancels older E2E
# runs when a newer push lands (memory:
# feedback_concurrency_group_per_sha) — the newer SHA will
# have its own E2E + promote chain. Treat the same as
# in_progress: defer without aborting, let the next E2E run
# promote when it lands.
#
# Caught 2026-05-05 02:03 on sha 31f9a5e — auto-promote
# blocked the whole chain because this case fell through to
# exit 1 instead of clean defer.
# GitHub-era only: cancelled ≠ failure. Gitea statuses
# don't expose a "cancelled" state — a per-SHA concurrency
# cancellation surfaces as `failure` or `error` on Gitea
# and is now handled by the failure branch above. This
# arm is kept for backwards compatibility / dual-host
# operation (if we ever add a non-Gitea fallback) but
# under the post-#75 flow it's unreachable.
echo "proceed=false" >> "$GITHUB_OUTPUT"
{
echo "## ⏭ Auto-promote deferred — E2E Staging SaaS was cancelled"
+404
View File
@@ -0,0 +1,404 @@
name: Auto-sync canary — AUTO_SYNC_TOKEN rotation drift
# Synthetic health check for the AUTO_SYNC_TOKEN secret consumed by
# auto-sync-main-to-staging.yml (PR #66) and publish-workspace-server-image.yml.
#
# ============================================================
# Why this workflow exists
# ============================================================
#
# PR #66 fixed auto-sync (replaced GitHub-era `gh pr create` — which
# 405s on Gitea's GraphQL endpoint — with a direct git push from the
# `devops-engineer` persona's `AUTO_SYNC_TOKEN`). Hostile self-review
# weakest spot #3 of that PR:
#
# "Token rotation silently breaks auto-sync. If AUTO_SYNC_TOKEN is
# rotated without updating the repo secret, every push to main
# fails red on the auto-sync push step. The workflow surfaces the
# failure mode in the step summary (failure mode B in the header),
# but there's no proactive monitoring."
#
# Detection latency under the status quo: rotation is only caught on
# the next push to `main`. During quiet periods (no main push for
# many hours) the staging-superset-of-main invariant silently breaks.
#
# This workflow closes the gap: every 6 hours, it fires the auth
# surface that auto-sync depends on and emits a red workflow status
# if AUTO_SYNC_TOKEN has drifted out of validity.
#
# ============================================================
# What this checks (Option B — read-only verify)
# ============================================================
#
# 1. `GET /api/v1/user` against Gitea with the token → validates the
# token authenticates AND resolves to `devops-engineer` (catches
# the case where the token was regenerated under a different
# persona by mistake).
# 2. `GET /api/v1/repos/molecule-ai/molecule-core` with the token →
# validates the token has `read:repository` scope on this repo
# (the v2 scope contract — see saved memory
# `reference_persona_token_v2_scope`).
# 3. `git push --dry-run` of the current staging SHA back to
# `refs/heads/staging` via `https://oauth2:<token>@<gitea>/...`
# → validates the EXACT HTTPS basic-auth path that
# `actions/checkout` + `git push origin staging` use inside
# auto-sync-main-to-staging.yml. NOP by construction (push the
# current tip to itself = "Everything up-to-date"); auth is
# checked at the smart-protocol handshake BEFORE the empty-diff
# computation, so bad token → exit 128 with "Authentication
# failed". `git ls-remote` is NOT used here because Gitea
# falls back to anonymous read on public repos and would
# silently green-light a rotated token.
#
# Each step exits non-zero with an actionable error message if it
# fails. The workflow status itself is the operator-facing surface.
#
# ============================================================
# What this does NOT check (intentional)
# ============================================================
#
# - **Branch-protection authz** (failure mode C in auto-sync header):
# would require an actual write to staging. Already monitored by
# `branch-protection-drift.yml` daily. Don't duplicate.
# - **Conflict resolution** (failure mode A): a real conflict is data-
# driven, not auth-driven; can't synthesise it without polluting
# staging. Already surfaces immediately on the next main push.
# - **Concurrency** (failure mode D): handled by workflow concurrency
# group on auto-sync, not a credential issue.
#
# ============================================================
# Why Option B (read-only) and not the alternatives
# ============================================================
#
# Considered + rejected (see issue #72 for full write-up):
#
# - **Option A — full auto-sync on schedule**: every run creates a
# no-op merge commit on staging when main hasn't advanced. 4 noise
# commits/day. And races the real `push:` trigger when main has
# advanced. Rejected.
#
# - **Option C — push to dedicated `auto-sync-canary` branch**: would
# exercise authz too, but adds branch noise on Gitea AND requires
# maintaining a second branch protection (or expanding staging's
# whitelist to a junk branch). Authz already covered by
# `branch-protection-drift.yml`. Rejected.
#
# Prior art for the chosen Option B shape:
# - Cloudflare's `/user/tokens/verify` endpoint (read-only auth
# probe explicitly designed for credential canaries).
# - AWS Secrets Manager rotation Lambda's `testSecret` step (auth
# probe before promoting AWSPENDING → AWSCURRENT).
# - HashiCorp Vault's `vault token lookup` for renewal canaries.
#
# ============================================================
# Operator runbook — what to do when this workflow goes RED
# ============================================================
#
# 1. **Identify which step failed**:
# - Step "Verify token authenticates as devops-engineer" red →
# token is invalid OR resolves to wrong persona.
# - Step "Verify token has repo read scope" red → token valid but
# stripped of `read:repository` scope (or repo perms changed).
# - Step "Verify git HTTPS auth path via no-op dry-run push to
# staging" red → token rotated/revoked OR Gitea git-HTTPS
# surface is broken (rare). Auth check happens on the
# smart-protocol handshake, separate from the API path.
#
# 2. **Re-issue the token** on the operator host:
# ```
# ssh root@5.78.80.188 'docker exec --user git molecule-gitea-1 \
# gitea admin user generate-access-token \
# --username devops-engineer \
# --token-name persona-devops-engineer-vN \
# --scopes "read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc"'
# ```
# Update `/etc/molecule-bootstrap/agent-secrets.env` in place
# (per `feedback_unified_credentials_file`). The previous token
# file lands at `.bak.<date>`.
#
# 3. **Update the repo Actions secret** at:
# Settings → Secrets and variables → Actions → AUTO_SYNC_TOKEN
# Paste the new token. (Don't echo it in chat — but per
# `feedback_passwords_in_chat_are_burned`, a paste in a 1:1
# Claude session is within trust boundary.)
#
# 4. **Re-run this canary** via workflow_dispatch. Confirm GREEN.
#
# 5. **Backfill any missed main → staging syncs** by re-running
# `auto-sync-main-to-staging.yml` from its workflow_dispatch
# surface, OR by pushing an empty commit to main (if you'd
# rather force a real trigger).
#
# ============================================================
# Security notes
# ============================================================
#
# - Token usage: read-only (`GET /api/v1/user`, `GET /api/v1/repos/...`,
# `git ls-remote`). No write paths. Same blast-radius profile as
# `actions/checkout` on a public repo.
# - The token NEVER appears in logs: every `curl` uses a header
# variable, never inline; the `git ls-remote` URL builds the
# `oauth2:$TOKEN@host` form into a single env var that's not
# echoed. GitHub Actions secret-masking covers anything that does
# slip through.
# - No new token introduced — same `AUTO_SYNC_TOKEN` the workflow
# under monitor uses. Per least-privilege we deliberately do NOT
# broaden scope for the canary.
on:
schedule:
# Every 6 hours at :17 (offsets the cron herd at :00). Justification
# from issue #72: cheap to run (~5s wall-clock, no quota), 3h average
# detection latency, 6h max. 1h would be 24× the runs for marginal
# benefit; daily would be 6× longer latency and worse than status
# quo on a quiet-main day.
- cron: '17 */6 * * *'
workflow_dispatch:
# No concurrency group needed — the canary is read-only and idempotent.
# Two parallel runs (e.g. operator dispatch during a scheduled tick) are
# harmless: same result, doubled HTTPS calls, no shared state.
permissions:
contents: read
jobs:
verify-token:
name: Verify AUTO_SYNC_TOKEN validity
runs-on: ubuntu-latest
# 2 min surfaces hangs (Gitea API stall, DNS issue) within one
# cron interval. Realistic worst case is ~10s: 2 curls + 1 git
# ls-remote, each capped by the explicit timeouts below.
timeout-minutes: 2
env:
# Pinned in env so individual steps can read it without
# repeating the secret reference. GitHub masks the value in
# logs automatically.
AUTO_SYNC_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
# MUST stay in sync with auto-sync-main-to-staging.yml's
# `git config user.name "devops-engineer"` line. Renaming the
# devops-engineer persona requires updating both files (and
# the staging branch protection's `push_whitelist_usernames`).
EXPECTED_PERSONA: devops-engineer
GITEA_HOST: git.moleculesai.app
REPO_PATH: molecule-ai/molecule-core
steps:
- name: Verify AUTO_SYNC_TOKEN secret is configured
# Schedule-vs-dispatch behaviour split, per
# `feedback_schedule_vs_dispatch_secrets_hardening`:
#
# - schedule: hard-fail when the secret is missing. The
# whole point of the canary is to surface drift; soft-
# skipping on missing-secret would make the canary
# itself drift-invisible (sweep-cf-orphans #2088 lesson).
# - workflow_dispatch: hard-fail too — there's no scenario
# where an operator wants this canary to silently no-op.
# The workflow has no other ad-hoc utility; if you ran
# it, you wanted the answer.
run: |
if [ -z "${AUTO_SYNC_TOKEN}" ]; then
echo "::error::AUTO_SYNC_TOKEN secret is not set on this repo." >&2
echo "::error::Set it at Settings → Secrets and variables → Actions." >&2
echo "::error::Without it, auto-sync-main-to-staging.yml will fail every push to main." >&2
exit 1
fi
echo "AUTO_SYNC_TOKEN is configured (value masked)."
- name: Verify token authenticates as ${{ env.EXPECTED_PERSONA }}
# Calls Gitea's `/api/v1/user` — the canonical
# auth-probe-with-no-side-effects endpoint (mirrors
# Cloudflare's /user/tokens/verify).
#
# Failure surfaces:
# - HTTP 401: token invalid (rotated, revoked, or never
# correctly registered).
# - HTTP 200 but username != devops-engineer: token was
# regenerated under the wrong persona — this would let
# auth pass but commit attribution would be wrong, and
# branch-protection authz would fail because only
# `devops-engineer` is whitelisted.
run: |
set -euo pipefail
response_file="$(mktemp)"
code_file="$(mktemp)"
# `--max-time 30`: full call ceiling. `--connect-timeout 10`:
# DNS + TCP. `-w "%{http_code}"` routed to a tempfile so curl's
# exit code can't pollute the captured status — see
# feedback_curl_status_capture_pollution + the
# `lint-curl-status-capture.yml` gate that rejects the unsafe
# `$(curl ... || echo "000")` shape.
set +e
curl -sS -o "$response_file" \
--max-time 30 --connect-timeout 10 \
-w "%{http_code}" \
-H "Authorization: token ${AUTO_SYNC_TOKEN}" \
-H "Accept: application/json" \
"https://${GITEA_HOST}/api/v1/user" >"$code_file" 2>/dev/null
set -e
status=$(cat "$code_file" 2>/dev/null || true)
[ -z "$status" ] && status="000"
if [ "$status" != "200" ]; then
echo "::error::Token rotation suspected: GET /api/v1/user returned HTTP $status (expected 200)." >&2
echo "::error::Likely cause: AUTO_SYNC_TOKEN has been rotated/revoked on Gitea but the repo Actions secret was not updated." >&2
echo "::error::Runbook: see header comment of this workflow file." >&2
# Print response body but redact anything that looks like a token.
sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
exit 1
fi
username=$(python3 -c "import json,sys; print(json.load(open(sys.argv[1])).get('login',''))" "$response_file")
if [ "$username" != "${EXPECTED_PERSONA}" ]; then
echo "::error::Token resolves to user '$username', expected '${EXPECTED_PERSONA}'." >&2
echo "::error::AUTO_SYNC_TOKEN must be the devops-engineer persona PAT (not founder PAT, not another persona)." >&2
echo "::error::Auto-sync push will fail because only 'devops-engineer' is whitelisted on staging branch protection." >&2
exit 1
fi
echo "Token authenticates as: $username ✓"
- name: Verify token has repo read scope
# `GET /api/v1/repos/<owner>/<repo>` requires `read:repository`
# on the persona's v2 scope contract. If the scope was
# narrowed/dropped on rotation we catch it here, before the
# next main push reveals it via a checkout failure.
run: |
set -euo pipefail
response_file="$(mktemp)"
code_file="$(mktemp)"
# See first probe step for the rationale on the tempfile-routed
# `-w "%{http_code}"` pattern — the unsafe `|| echo "000"` shape
# is rejected by lint-curl-status-capture.yml.
set +e
curl -sS -o "$response_file" \
--max-time 30 --connect-timeout 10 \
-w "%{http_code}" \
-H "Authorization: token ${AUTO_SYNC_TOKEN}" \
-H "Accept: application/json" \
"https://${GITEA_HOST}/api/v1/repos/${REPO_PATH}" >"$code_file" 2>/dev/null
set -e
status=$(cat "$code_file" 2>/dev/null || true)
[ -z "$status" ] && status="000"
if [ "$status" != "200" ]; then
echo "::error::Token lacks read:repository scope on ${REPO_PATH}: HTTP $status." >&2
echo "::error::Auto-sync's actions/checkout step will fail with this token." >&2
echo "::error::Re-issue with v2 scope contract: read:repository,write:repository,read:user,read:organization,read:issue,write:issue,read:notification,read:misc" >&2
sed -E 's/[A-Fa-f0-9]{32,}/<redacted>/g' "$response_file" >&2 || true
exit 1
fi
echo "Token has read:repository on ${REPO_PATH} ✓"
- name: Verify git HTTPS auth path via no-op dry-run push to staging
# Final probe: exercise the EXACT auth path that
# `actions/checkout` + `git push origin staging` use in
# auto-sync-main-to-staging.yml. Gitea's API and git-HTTPS
# surfaces share the token-lookup code path internally but
# the wire-level error shapes differ — historically (#173)
# the API path was healthy while git-HTTPS rejected, so
# checking only the API would have given false-green.
#
# IMPORTANT: `git ls-remote` on a public repo (which
# molecule-core is) succeeds even with a junk token because
# Gitea falls back to anonymous-read. `ls-remote` therefore
# CANNOT validate auth on this surface. We use
# `git push --dry-run` instead — push is auth-gated even on
# public repos.
#
# NOP shape: read the current staging SHA via authenticated
# ls-remote (the SHA itself is public; auth is incidental
# here, used only to colocate the discovery in one step), then
# `git push --dry-run <SHA>:refs/heads/staging`. Pushing the
# current tip back to itself is "Everything up-to-date" with
# exit 0 when auth succeeds. With a bad token Gitea returns
# HTTP 401 in the smart-protocol handshake and git exits 128
# with "Authentication failed".
#
# The dry-run never reaches Gitea's pre-receive hook (which
# is where branch-protection authz runs), so this probe does
# not validate failure mode C. That's intentional —
# branch-protection-drift.yml owns authz monitoring; this
# canary owns auth.
env:
# Don't hang waiting for password prompt if auth fails on a
# terminal-attached run. (In Actions there's no terminal,
# but the env-var hardens against an interactive runner
# config.)
GIT_TERMINAL_PROMPT: "0"
run: |
set -euo pipefail
# Token is in $AUTO_SYNC_TOKEN (job-level env). Compose the
# URL as a local var that's never echoed.
url="https://oauth2:${AUTO_SYNC_TOKEN}@${GITEA_HOST}/${REPO_PATH}"
# Step a: read current staging SHA. ~1KB; auth-gated only
# on private repos but always works on public — used here
# only to discover the SHA, not to validate auth.
staging_ref=$(timeout 30s git ls-remote --refs "$url" refs/heads/staging 2>&1) || {
redacted=$(echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
echo "::error::ls-remote against staging failed (network/DNS issue):" >&2
echo "$redacted" >&2
exit 1
}
if ! echo "$staging_ref" | grep -qE '^[0-9a-f]{40}[[:space:]]+refs/heads/staging$'; then
echo "::error::ls-remote returned unexpected shape:" >&2
echo "$staging_ref" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g" >&2
exit 1
fi
staging_sha=$(echo "$staging_ref" | awk '{print $1}')
# Step b: spin up an ephemeral local repo. `git push` always
# requires a local repo even when pushing a remote SHA that
# isn't in the local object DB (the protocol negotiates and
# discovers we don't need to send any objects). We don't use
# `actions/checkout` for this — it would clone the whole
# repo (~hundreds of MB) for what's essentially `git init`.
tmp_repo="$(mktemp -d)"
trap 'rm -rf "$tmp_repo"' EXIT
git -C "$tmp_repo" init -q
# Author config required for any git operation; values are
# arbitrary because nothing gets committed here.
git -C "$tmp_repo" config user.email canary@auto-sync.local
git -C "$tmp_repo" config user.name auto-sync-canary
# Step c: dry-run push the current staging SHA back to
# staging. NOP by construction — the remote tip equals the
# SHA we're pushing, so "Everything up-to-date" is the
# success path.
#
# Authentication is checked at the smart-protocol handshake,
# BEFORE the dry-run can compute an empty diff. Bad token
# → "Authentication failed", exit 128. Good token → exit 0.
set +e
push_out=$(timeout 30s git -C "$tmp_repo" push --dry-run "$url" "${staging_sha}:refs/heads/staging" 2>&1)
push_rc=$?
set -e
if [ "$push_rc" -ne 0 ]; then
redacted=$(echo "$push_out" | sed -E "s|oauth2:[^@]+@|oauth2:<redacted>@|g")
echo "::error::Token rotation suspected: git push --dry-run against staging failed via the AUTO_SYNC_TOKEN HTTPS auth path (exit $push_rc)." >&2
echo "::error::This is the EXACT auth path that actions/checkout + git push use in auto-sync-main-to-staging.yml." >&2
echo "::error::Likely cause: AUTO_SYNC_TOKEN was rotated/revoked on Gitea but the repo Actions secret was not updated. Runbook: see header." >&2
echo "$redacted" >&2
exit 1
fi
echo "git HTTPS auth path: NOP push --dry-run to staging → ${staging_sha:0:8} ✓"
- name: Summarise canary result
# Everything passed — surface a green summary. (Failures
# already wrote ::error:: lines and exited above; if we got
# here, all three probes passed.)
run: |
{
echo "## Auto-sync canary: GREEN"
echo ""
echo "AUTO_SYNC_TOKEN is healthy:"
echo "- Authenticates as \`${EXPECTED_PERSONA}\` ✓"
echo "- Has \`read:repository\` scope on \`${REPO_PATH}\` ✓"
echo "- Git HTTPS auth path: no-op dry-run push to \`refs/heads/staging\` succeeds ✓"
echo ""
echo "Auto-sync main → staging will succeed on the next push to main."
echo "If this canary ever goes RED, see the runbook in this workflow's header."
} >> "$GITHUB_STEP_SUMMARY"
+7 -1
View File
@@ -235,7 +235,13 @@ jobs:
run: npx vitest run --coverage
- name: Upload coverage summary as artifact
if: needs.changes.outputs.canvas == 'true' && always()
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
# Pinned to v3 for Gitea act_runner v0.6 compatibility — v4+ uses
# the GHES 3.10+ artifact protocol that Gitea 1.22.x does NOT
# implement, surfacing as `GHESNotSupportedError: @actions/artifact
# v2.0.0+, upload-artifact@v4+ and download-artifact@v4+ are not
# currently supported on GHES`. Drop this pin when Gitea ships
# the v4 protocol (tracked: post-Gitea-1.23 followup).
uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2
with:
name: canvas-coverage-${{ github.run_id }}
path: canvas/coverage/
+123 -7
View File
@@ -12,6 +12,59 @@ name: E2E API Smoke Test
# spending CI cycles. See the in-job comment on the `e2e-api` job for
# why this is one job (not two-jobs-sharing-name) and the 2026-04-29
# PR #2264 incident that drove the consolidation.
#
# Parallel-safety (Class B Hongming-owned CICD red sweep, 2026-05-08)
# -------------------------------------------------------------------
# Same substrate hazard as PR #98 (handlers-postgres-integration). Our
# Gitea act_runner runs with `container.network: host` (operator host
# `/opt/molecule/runners/config.yaml`), which means:
#
# * Two concurrent runs both try to bind their `-p 15432:5432` /
# `-p 16379:6379` host ports — the second postgres/redis FATALs
# with `Address in use` and `docker run` returns exit 125 with
# `Conflict. The container name "/molecule-ci-postgres" is already
# in use by container ...`. Verified in run a7/2727 on 2026-05-07.
# * The fixed container names `molecule-ci-postgres` / `-redis` (the
# pre-fix shape) collide on name AS WELL AS port. The cleanup-with-
# `docker rm -f` at the start of the second job KILLS the first
# job's still-running postgres/redis.
#
# Fix shape (mirrors PR #98's bridge-net pattern, adapted because
# platform-server is a Go binary on the host, not a containerised
# step):
#
# 1. Unique container names per run:
# pg-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
# redis-e2e-api-${RUN_ID}-${RUN_ATTEMPT}
# `${RUN_ID}-${RUN_ATTEMPT}` is unique even across reruns of the
# same run_id.
# 2. Ephemeral host port per run (`-p 0:5432`), then read the actual
# bound port via `docker port` and export DATABASE_URL/REDIS_URL
# pointing at it. No fixed host-port → no port collision.
# 3. `127.0.0.1` (NOT `localhost`) in URLs — IPv6 first-resolve was
# the original flake fixed in #92 and the script's still IPv6-
# enabled.
# 4. `if: always()` cleanup so containers don't leak when test steps
# fail.
#
# Issue #94 items #2 + #3 (also fixed here):
# * Pre-pull `alpine:latest` so the platform-server's provisioner
# (`internal/handlers/container_files.go`) can stand up its
# ephemeral token-write helper without a daemon.io round-trip.
# * Create `molecule-monorepo-net` bridge network if missing so the
# provisioner's container.HostConfig {NetworkMode: ...} attach
# succeeds.
# Item #1 (timeouts) — evidence on recent runs (77/3191, ae/4270, 0e/
# 2318) shows Postgres ready in 3s, Redis in 1s, Platform in 1s when
# they DO come up. Timeouts are not the bottleneck; not bumped.
#
# Item explicitly NOT fixed here: failing test `Status back online`
# fails because the platform's langgraph workspace template image
# (ghcr.io/molecule-ai/workspace-template-langgraph:latest) returns
# 403 Forbidden post-2026-05-06 GitHub org suspension. That is a
# template-registry resolution issue (ADR-002 / local-build mode) and
# belongs in a separate change that touches workspace-server, not
# this workflow file.
on:
push:
@@ -78,11 +131,14 @@ jobs:
runs-on: ubuntu-latest
timeout-minutes: 15
env:
DATABASE_URL: postgres://dev:dev@localhost:15432/molecule?sslmode=disable
REDIS_URL: redis://localhost:16379
# Unique per-run container names so concurrent runs on the host-
# network act_runner don't collide on name OR port.
# `${RUN_ID}-${RUN_ATTEMPT}` stays unique across reruns of the
# same run_id. PORT is set later (after docker port lookup) since
# we let Docker assign an ephemeral host port.
PG_CONTAINER: pg-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
REDIS_CONTAINER: redis-e2e-api-${{ github.run_id }}-${{ github.run_attempt }}
PORT: "8080"
PG_CONTAINER: molecule-ci-postgres
REDIS_CONTAINER: molecule-ci-redis
steps:
- name: No-op pass (paths filter excluded this commit)
if: needs.detect-changes.outputs.api != 'true'
@@ -97,11 +153,53 @@ jobs:
go-version: 'stable'
cache: true
cache-dependency-path: workspace-server/go.sum
- name: Pre-pull alpine + ensure provisioner network (Issue #94 items #2 + #3)
if: needs.detect-changes.outputs.api == 'true'
run: |
# Provisioner uses alpine:latest for ephemeral token-write
# containers (workspace-server/internal/handlers/container_files.go).
# Pre-pull so the first provision in test_api.sh doesn't race
# the daemon's pull cache. Idempotent — `docker pull` is a no-op
# when the image is already present.
docker pull alpine:latest >/dev/null
# Provisioner attaches workspace containers to
# molecule-monorepo-net (workspace-server/internal/provisioner/
# provisioner.go::DefaultNetwork). The bridge already exists on
# the operator host's docker daemon — `network create` is
# idempotent via `|| true`.
docker network create molecule-monorepo-net >/dev/null 2>&1 || true
echo "alpine:latest pre-pulled; molecule-monorepo-net ensured."
- name: Start Postgres (docker)
if: needs.detect-changes.outputs.api == 'true'
run: |
# Defensive cleanup — only matches THIS run's container name,
# so it cannot kill a sibling run's postgres. (Pre-fix the
# name was static and this rm hit other runs' containers.)
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
docker run -d --name "$PG_CONTAINER" -e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule -p 15432:5432 postgres:16
# `-p 0:5432` requests an ephemeral host port; we read it back
# below and export DATABASE_URL.
docker run -d --name "$PG_CONTAINER" \
-e POSTGRES_USER=dev -e POSTGRES_PASSWORD=dev -e POSTGRES_DB=molecule \
-p 0:5432 postgres:16 >/dev/null
# Resolve the host-side port assignment. `docker port` prints
# `0.0.0.0:NNNN` (and on host-net runners may also print an
# IPv6 line — take the first IPv4 line).
PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
if [ -z "$PG_PORT" ]; then
# Fallback: any first line. Some Docker versions print only
# one line.
PG_PORT=$(docker port "$PG_CONTAINER" 5432/tcp | head -1 | awk -F: '{print $NF}')
fi
if [ -z "$PG_PORT" ]; then
echo "::error::Could not resolve host port for $PG_CONTAINER"
docker port "$PG_CONTAINER" 5432/tcp || true
docker logs "$PG_CONTAINER" || true
exit 1
fi
# 127.0.0.1 (NOT localhost) — IPv6 first-resolve flake (#92).
echo "PG_PORT=${PG_PORT}" >> "$GITHUB_ENV"
echo "DATABASE_URL=postgres://dev:dev@127.0.0.1:${PG_PORT}/molecule?sslmode=disable" >> "$GITHUB_ENV"
echo "Postgres host port: ${PG_PORT}"
for i in $(seq 1 30); do
if docker exec "$PG_CONTAINER" pg_isready -U dev >/dev/null 2>&1; then
echo "Postgres ready after ${i}s"
@@ -116,7 +214,20 @@ jobs:
if: needs.detect-changes.outputs.api == 'true'
run: |
docker rm -f "$REDIS_CONTAINER" 2>/dev/null || true
docker run -d --name "$REDIS_CONTAINER" -p 16379:6379 redis:7
docker run -d --name "$REDIS_CONTAINER" -p 0:6379 redis:7 >/dev/null
REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | awk -F: '/^0\.0\.0\.0:/ {print $2; exit}')
if [ -z "$REDIS_PORT" ]; then
REDIS_PORT=$(docker port "$REDIS_CONTAINER" 6379/tcp | head -1 | awk -F: '{print $NF}')
fi
if [ -z "$REDIS_PORT" ]; then
echo "::error::Could not resolve host port for $REDIS_CONTAINER"
docker port "$REDIS_CONTAINER" 6379/tcp || true
docker logs "$REDIS_CONTAINER" || true
exit 1
fi
echo "REDIS_PORT=${REDIS_PORT}" >> "$GITHUB_ENV"
echo "REDIS_URL=redis://127.0.0.1:${REDIS_PORT}" >> "$GITHUB_ENV"
echo "Redis host port: ${REDIS_PORT}"
for i in $(seq 1 15); do
if docker exec "$REDIS_CONTAINER" redis-cli ping 2>/dev/null | grep -q PONG; then
echo "Redis ready after ${i}s"
@@ -135,13 +246,15 @@ jobs:
if: needs.detect-changes.outputs.api == 'true'
working-directory: workspace-server
run: |
# DATABASE_URL + REDIS_URL exported by the start-postgres /
# start-redis steps point at this run's per-run host ports.
./platform-server > platform.log 2>&1 &
echo $! > platform.pid
- name: Wait for /health
if: needs.detect-changes.outputs.api == 'true'
run: |
for i in $(seq 1 30); do
if curl -sf http://localhost:8080/health > /dev/null; then
if curl -sf http://127.0.0.1:8080/health > /dev/null; then
echo "Platform up after ${i}s"
exit 0
fi
@@ -185,6 +298,9 @@ jobs:
kill "$(cat workspace-server/platform.pid)" 2>/dev/null || true
fi
- name: Stop service containers
# always() so containers don't leak when test steps fail. The
# cleanup is best-effort: if the container is already gone
# (e.g. concurrent rerun race), don't fail the job.
if: always() && needs.detect-changes.outputs.api == 'true'
run: |
docker rm -f "$PG_CONTAINER" 2>/dev/null || true
+7 -2
View File
@@ -139,7 +139,11 @@ jobs:
- name: Upload Playwright report on failure
if: failure() && needs.detect-changes.outputs.canvas == 'true'
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
# Pinned to v3 for Gitea act_runner v0.6 compatibility — v4+ uses
# the GHES 3.10+ artifact protocol that Gitea 1.22.x does NOT
# implement (see ci.yml upload step for the canonical error
# cite). Drop this pin when Gitea ships the v4 protocol.
uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2
with:
name: playwright-report-staging
path: canvas/playwright-report-staging/
@@ -147,7 +151,8 @@ jobs:
- name: Upload screenshots on failure
if: failure() && needs.detect-changes.outputs.canvas == 'true'
uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1
# Pinned to v3 for Gitea act_runner v0.6 compatibility (see above).
uses: actions/upload-artifact@c6a366c94c3e0affe28c06c8df20a878f24da3cf # v3.2.2
with:
name: playwright-screenshots
path: canvas/test-results/
@@ -14,12 +14,42 @@ name: Handlers Postgres Integration
# self-review caught it took 2 minutes to set up and would have caught
# the bug at PR-time.
#
# This job spins a Postgres service container, applies the migration,
# and runs `go test -tags=integration` against a live DB. Required
# check on staging branch protection — backend handler PRs cannot
# merge without a real-DB regression gate.
# Why this workflow does NOT use `services: postgres:` (Class B fix)
# ------------------------------------------------------------------
# Our act_runner config has `container.network: host` (operator host
# /opt/molecule/runners/config.yaml), which act_runner applies to BOTH
# the job container AND every service container. With host-net, two
# concurrent runs of this workflow both try to bind 0.0.0.0:5432 — the
# second postgres FATALs with `could not create any TCP/IP sockets:
# Address in use`, and Docker auto-removes it (act_runner sets
# AutoRemove:true on service containers). By the time the migrations
# step runs `psql`, the postgres container is gone, hence
# `Connection refused` then `failed to remove container: No such
# container` at cleanup time.
#
# Cost: ~30s job (postgres pull from GH cache + go build + 4 tests).
# Per-job `container.network` override is silently ignored by
# act_runner — `--network and --net in the options will be ignored.`
# appears in the runner log. Documented constraint.
#
# So we sidestep `services:` entirely. The job container still uses
# host-net (inherited from runner config; required for cache server
# discovery on the bridge IP 172.18.0.17:42631). We launch a sibling
# postgres on the existing `molecule-monorepo-net` bridge with a
# UNIQUE name per run — `pg-handlers-${RUN_ID}-${RUN_ATTEMPT}` — and
# read its bridge IP via `docker inspect`. A host-net job container
# can reach a bridge-net container directly via the bridge IP (verified
# manually on operator host 2026-05-08).
#
# Trade-offs vs. the original `services:` shape:
# + No host-port collision; N parallel runs share the bridge cleanly
# + `if: always()` cleanup runs even on test-step failure
# - One more step in the workflow (+~3 lines)
# - Requires `molecule-monorepo-net` to exist on the operator host
# (it does; declared in docker-compose.yml + docker-compose.infra.yml)
#
# Class B Hongming-owned CICD red sweep, 2026-05-08.
#
# Cost: ~30s job (postgres pull from cache + go build + 4 tests).
on:
push:
@@ -59,20 +89,14 @@ jobs:
name: Handlers Postgres Integration
needs: detect-changes
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15-alpine
env:
POSTGRES_PASSWORD: test
POSTGRES_DB: molecule
ports:
- 5432:5432
# GHA spins this with --health-cmd built in for postgres images.
options: >-
--health-cmd pg_isready
--health-interval 5s
--health-timeout 5s
--health-retries 10
env:
# Unique name per run so concurrent jobs don't collide on the
# bridge network. ${RUN_ID}-${RUN_ATTEMPT} is unique even across
# workflow_dispatch reruns of the same run_id.
PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
# Bridge network already exists on the operator host (declared
# in docker-compose.yml + docker-compose.infra.yml).
PG_NETWORK: molecule-monorepo-net
defaults:
run:
working-directory: workspace-server
@@ -89,16 +113,57 @@ jobs:
with:
go-version: 'stable'
- if: needs.detect-changes.outputs.handlers == 'true'
name: Start sibling Postgres on bridge network
working-directory: .
run: |
# Sanity: the bridge network must exist on the operator host.
# Hard-fail loud if it doesn't — easier to spot than a silent
# auto-create that diverges from the rest of the stack.
if ! docker network inspect "${PG_NETWORK}" >/dev/null 2>&1; then
echo "::error::Bridge network '${PG_NETWORK}' missing on operator host. Re-run docker-compose.infra.yml or check ops handbook."
exit 1
fi
# If a stale container with the same name exists (rerun on
# the same run_id), wipe it first.
docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
docker run -d \
--name "${PG_NAME}" \
--network "${PG_NETWORK}" \
--health-cmd "pg_isready -U postgres" \
--health-interval 5s \
--health-timeout 5s \
--health-retries 10 \
-e POSTGRES_PASSWORD=test \
-e POSTGRES_DB=molecule \
postgres:15-alpine >/dev/null
# Read back the bridge IP. Always present immediately after
# `docker run -d` for bridge networks.
PG_HOST=$(docker inspect "${PG_NAME}" \
--format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
if [ -z "${PG_HOST}" ]; then
echo "::error::Could not resolve PG_HOST for ${PG_NAME} on ${PG_NETWORK}"
docker logs "${PG_NAME}" || true
exit 1
fi
echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
echo "INTEGRATION_DB_URL=postgres://postgres:test@${PG_HOST}:5432/molecule?sslmode=disable" >> "$GITHUB_ENV"
echo "Started ${PG_NAME} at ${PG_HOST}:5432"
- if: needs.detect-changes.outputs.handlers == 'true'
name: Apply migrations to Postgres service
env:
PGPASSWORD: test
run: |
# Wait for postgres to actually accept connections (the
# GHA --health-cmd is best-effort but psql can still race).
# Wait for postgres to actually accept connections. Docker's
# health-cmd handles container-side readiness, but the wire
# to the bridge IP is best-tested with pg_isready directly.
for i in {1..15}; do
if pg_isready -h localhost -p 5432 -U postgres -q; then break; fi
echo "waiting for postgres..."; sleep 2
if pg_isready -h "${PG_HOST}" -p 5432 -U postgres -q; then break; fi
echo "waiting for postgres at ${PG_HOST}:5432..."; sleep 2
done
# Apply every .up.sql in lexicographic order with
@@ -131,7 +196,7 @@ jobs:
# not fine once a cross-table atomicity test came in.
set +e
for migration in $(ls migrations/*.sql 2>/dev/null | grep -v '\.down\.sql$' | sort); do
if psql -h localhost -U postgres -d molecule -v ON_ERROR_STOP=1 \
if psql -h "${PG_HOST}" -U postgres -d molecule -v ON_ERROR_STOP=1 \
-f "$migration" >/dev/null 2>&1; then
echo "✓ $(basename "$migration")"
else
@@ -145,7 +210,7 @@ jobs:
# fail if any didn't land — that would be a real regression we
# want loud.
for tbl in delegations workspaces activity_logs pending_uploads; do
if ! psql -h localhost -U postgres -d molecule -tA \
if ! psql -h "${PG_HOST}" -U postgres -d molecule -tA \
-c "SELECT 1 FROM information_schema.tables WHERE table_name = '$tbl'" \
| grep -q 1; then
echo "::error::$tbl table missing after migration replay — handler integration tests would be meaningless"
@@ -156,16 +221,32 @@ jobs:
- if: needs.detect-changes.outputs.handlers == 'true'
name: Run integration tests
env:
INTEGRATION_DB_URL: postgres://postgres:test@localhost:5432/molecule?sslmode=disable
run: |
# INTEGRATION_DB_URL is exported by the start-postgres step;
# points at the per-run bridge IP, not 127.0.0.1, so concurrent
# workflow runs don't fight over a host-net 5432 port.
go test -tags=integration -timeout 5m -v ./internal/handlers/ -run "^TestIntegration_"
- if: needs.detect-changes.outputs.handlers == 'true' && failure()
- if: failure() && needs.detect-changes.outputs.handlers == 'true'
name: Diagnostic dump on failure
env:
PGPASSWORD: test
run: |
echo "::group::delegations table state"
psql -h localhost -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
echo "::group::postgres container status"
docker ps -a --filter "name=${PG_NAME}" --format '{{.Status}} {{.Names}}' || true
docker logs "${PG_NAME}" 2>&1 | tail -50 || true
echo "::endgroup::"
echo "::group::delegations table state"
psql -h "${PG_HOST}" -U postgres -d molecule -c "SELECT * FROM delegations LIMIT 50;" || true
echo "::endgroup::"
- if: always() && needs.detect-changes.outputs.handlers == 'true'
name: Stop sibling Postgres
working-directory: .
run: |
# always() so containers don't leak when migrations or tests
# fail. The cleanup is best-effort: if the container is
# already gone (e.g. concurrent rerun race), don't fail the job.
docker rm -f "${PG_NAME}" >/dev/null 2>&1 || true
echo "Cleaned up ${PG_NAME}"
+227 -56
View File
@@ -1,16 +1,99 @@
name: Retarget main PRs to staging
# Mechanical enforcement of SHARED_RULES rule 8 ("Staging-first workflow, no
# exceptions"). When a bot opens a PR against main, retarget it to staging
# automatically and leave an explanatory comment. Human CEO-authored PRs (the
# staging→main promotion PR, etc.) are left alone — they're the authorised
# exception to the rule.
# Mechanical enforcement of SHARED_RULES rule 8 ("Staging-first
# workflow, no exceptions"). When a bot opens a PR against `main`,
# retarget it to `staging` automatically and leave an explanatory
# comment. Human / CEO-authored PRs (the staging→main promotion
# PRs, etc.) are left alone — they're the authorised exception
# to the rule.
#
# Why an Action instead of only a prompt rule: prompt rules depend on every
# role's system-prompt.md staying in sync. Today 5 of 8 engineer roles
# (core-be, core-fe, app-fe, app-qa, devops-engineer) don't have the
# staging-first section — the bot keeps opening PRs to main. An Action
# enforces the invariant regardless of prompt drift.
# ============================================================
# What this workflow does
# ============================================================
#
# On `pull_request_target` opened/reopened against `main`:
# 1. If the PR head is `staging`, skip (the auto-promote PRs
# MUST stay base=main).
# 2. If the PR author is a bot, retarget the PR base to
# `staging` via Gitea REST `PATCH /pulls/{N}` body
# `{"base":"staging"}`.
# 3. If the retarget returns 422 "pull request already exists
# for base branch 'staging'" (issue #1884 case: another PR
# on the same head already targets staging), close the
# now-redundant main-PR via Gitea REST instead of failing
# red.
# 4. Post an explainer comment on the retargeted PR via
# Gitea REST `POST /issues/{N}/comments`.
#
# ============================================================
# Why Gitea REST (and not `gh api / gh pr close / gh pr comment`)
# ============================================================
#
# Pre-2026-05-06 this workflow used `gh api -X PATCH "repos/{owner}/{repo}/pulls/{N}" -f base=staging`
# plus `gh pr close` and `gh pr comment`. After the GitHub→Gitea
# cutover those calls fail because:
#
# - `gh` CLI defaults to `api.github.com`. Even with `GH_HOST`
# pointing at Gitea, `gh pr close / comment` route through
# GraphQL (`/api/graphql`) which Gitea does not expose.
# Empirical: every `gh pr *` call returns
# `HTTP 405 Method Not Allowed (https://git.moleculesai.app/api/graphql)`
# — same root cause as #65 (auto-sync, fixed in PR #66) and
# #73/#195 (auto-promote, fixed in PR #78).
# - `gh api -X PATCH /pulls/{N}` happens to use a REST path
# that Gitea also has, but the `gh` host-resolution layer
# and pagination/retry logic don't always hit Gitea cleanly,
# and the cost of switching to direct `curl` is one extra
# line of code.
#
# So this workflow uses direct `curl` calls to Gitea REST. No
# `gh` CLI dependency, no GraphQL, no flaky host-resolution.
#
# ============================================================
# Identity + token (anti-bot-ring per saved-memory
# `feedback_per_agent_gitea_identity_default`)
# ============================================================
#
# Pre-fix this workflow used the per-job ephemeral
# `secrets.GITHUB_TOKEN`. On Gitea Actions that token has
# narrow scope and unpredictable cross-PR write capability.
#
# Post-fix: `secrets.AUTO_SYNC_TOKEN` (the `devops-engineer`
# Gitea persona). Same persona used by `auto-sync-main-to-staging.yml`
# (PR #66) and `auto-promote-staging.yml` (PR #78). Token scope:
# `push: true` repo write, sufficient for PR-edit + close + comment.
#
# Why this token does NOT need branch-protection bypass:
# patching a PR's base ref is a PR-level operation that does not
# require push perms on either branch (the PR's own commits stay
# put; only the metadata changes).
#
# ============================================================
# Failure modes & operational notes
# ============================================================
#
# A — PATCH base→staging returns 422 "pull request already exists"
# (issue #1884 case):
# - Detected by string-match on response body. Workflow
# falls through to closing the now-redundant main-PR
# (Gitea REST `PATCH /pulls/{N}` with `state: closed`)
# and posts an explanation comment. Step summary surfaces.
#
# B — `AUTO_SYNC_TOKEN` rotated / wrong scope:
# - First REST call returns 401/403. Step summary surfaces.
# Re-issue token from `~/.molecule-ai/personas/` on the
# operator host and update repo Actions secret.
#
# C — PR was deleted between trigger and run:
# - REST call returns 404. Workflow exits 0 with a notice
# (the rule was already enforced or the PR is gone).
#
# D — author is not actually a bot but the filter mis-fires:
# - Filter is conservative: only triggers on
# `user.type == 'Bot'`, `login` ends with `[bot]`, or
# known bot logins (`molecule-ai[bot]`, `app/molecule-ai`).
# Human PRs slip through unaffected. If a NEW bot login
# starts shipping main-PRs, add it to the filter.
on:
pull_request_target:
@@ -24,16 +107,16 @@ jobs:
retarget:
name: Retarget to staging
runs-on: ubuntu-latest
# Only fire for bot-authored PRs. Human CEO PRs (staging→main promotion)
# are intentional and pass through.
# Only fire for bot-authored PRs. Human CEO PRs (staging→main
# promotion) are intentional and pass through.
#
# Head-ref guard: never retarget a PR whose head IS `staging` — those
# are the auto-promote staging→main PRs (opened by molecule-ai[bot]
# since #2586 switched to an App token, which now passes the bot
# filter below). Retargeting head=staging onto base=staging fails
# with HTTP 422 "no new commits between base 'staging' and head
# 'staging'", which used to surface as a noisy red workflow run on
# every auto-promote (caught 2026-05-03 on PR #2588).
# Head-ref guard: never retarget a PR whose head IS `staging`
# — those are the auto-promote staging→main PRs (opened by
# `devops-engineer` since PR #78 / #195 fix). Retargeting
# head=staging onto base=staging fails with HTTP 422 "no new
# commits between base 'staging' and head 'staging'", which
# would surface as a noisy red workflow run on every
# auto-promote (caught 2026-05-03 on the GitHub-era PR #2588).
if: >-
github.event.pull_request.head.ref != 'staging'
&& (
@@ -41,65 +124,153 @@ jobs:
|| endsWith(github.event.pull_request.user.login, '[bot]')
|| github.event.pull_request.user.login == 'app/molecule-ai'
|| github.event.pull_request.user.login == 'molecule-ai[bot]'
|| github.event.pull_request.user.login == 'devops-engineer'
)
steps:
- name: Retarget PR base to staging
- name: Retarget PR base to staging via Gitea REST
id: retarget
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
# Issue #1884: when the bot opens a PR against main and there's
# already another PR on the same head branch targeting staging,
# GitHub's PATCH /pulls returns 422 with
# "A pull request already exists for base branch 'staging' …".
# The retarget can't proceed — but the right response is to
# close the now-redundant main-PR, not to fail the workflow
# noisily. Detect that specific 422 and close instead.
# Issue #1884 case: when the bot opens a PR against main
# and there's already another PR on the same head branch
# targeting staging, Gitea's PATCH returns 422 with a
# body mentioning "pull request already exists for base
# branch 'staging'" (the Gitea message wording is
# slightly different from GitHub's; the substring match
# below covers both for forward/back compat).
# The retarget can't proceed — but the right response is
# to close the now-redundant main-PR, not to fail the
# workflow noisily. Detect that specific 422 and close
# instead.
run: |
set +e
set -euo pipefail
API="${GITEA_HOST}/api/v1/repos/${REPO}"
AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")
echo "Retargeting PR #${PR_NUMBER} (author: ${PR_AUTHOR}) from main → staging"
PATCH_OUTPUT=$(gh api -X PATCH \
"repos/${{ github.repository }}/pulls/${PR_NUMBER}" \
-f base=staging \
--jq '.base.ref' 2>&1)
PATCH_EXIT=$?
# Curl-status-capture pattern per `feedback_curl_status_capture_pollution`:
# http_code via -w to its own scalar, body to a tempfile, set +e/-e
# bracket so curl's non-zero-on-4xx doesn't pollute the script's exit chain.
BODY_FILE=$(mktemp)
REQ='{"base":"staging"}'
set +e
STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-X PATCH -d "${REQ}" \
-o "${BODY_FILE}" -w "%{http_code}" \
"${API}/pulls/${PR_NUMBER}")
CURL_RC=$?
set -e
if [ "$PATCH_EXIT" -eq 0 ]; then
echo "::notice::Retargeted PR #${PR_NUMBER} → staging"
echo "outcome=retargeted" >> "$GITHUB_OUTPUT"
exit 0
if [ "${CURL_RC}" -ne 0 ]; then
echo "::error::curl PATCH failed (rc=${CURL_RC})"
rm -f "${BODY_FILE}"
exit 1
fi
if [ "${STATUS}" = "201" ] || [ "${STATUS}" = "200" ]; then
NEW_BASE=$(jq -r '.base.ref // "?"' < "${BODY_FILE}")
rm -f "${BODY_FILE}"
if [ "${NEW_BASE}" = "staging" ]; then
echo "::notice::Retargeted PR #${PR_NUMBER} → staging"
echo "outcome=retargeted" >> "$GITHUB_OUTPUT"
exit 0
fi
echo "::error::PATCH returned ${STATUS} but base.ref is '${NEW_BASE}', not 'staging'"
exit 1
fi
# Specifically match the 422 duplicate-base/head error so
# any OTHER PATCH failure (auth, deleted PR, etc.) still
# surfaces as a real workflow failure.
if echo "$PATCH_OUTPUT" | grep -q "pull request already exists for base branch 'staging'"; then
BODY=$(cat "${BODY_FILE}" || true)
rm -f "${BODY_FILE}"
if [ "${STATUS}" = "422" ] && echo "${BODY}" | grep -qE "(pull request already exists for base branch 'staging'|already exists.*base.*staging)"; then
echo "::notice::PR #${PR_NUMBER}: duplicate target-staging PR exists on same head — closing this main-PR as redundant."
gh pr close "$PR_NUMBER" \
--repo "${{ github.repository }}" \
--comment "[retarget-bot] Closing — another PR on the same head branch already targets \`staging\`. This PR is redundant. See issue #1884 for the rationale."
echo "outcome=closed-as-duplicate" >> "$GITHUB_OUTPUT"
exit 0
# Close the now-redundant main-PR via Gitea REST
# (PATCH state=closed). Post comment explaining
# rationale BEFORE close so the comment lands on the
# PR (commenting on a closed PR works on Gitea, but
# historically caused notification ordering surprises).
CLOSE_BODY_FILE=$(mktemp)
CMT_REQ=$(jq -n '{body:"[retarget-bot] Closing — another PR on the same head branch already targets `staging`. This PR is redundant. See issue #1884 for the rationale."}')
set +e
CMT_STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-X POST -d "${CMT_REQ}" \
-o "${CLOSE_BODY_FILE}" -w "%{http_code}" \
"${API}/issues/${PR_NUMBER}/comments")
set -e
if [ "${CMT_STATUS}" != "201" ]; then
echo "::warning::dup-close comment POST returned ${CMT_STATUS}; continuing to close anyway"
cat "${CLOSE_BODY_FILE}" | head -c 300 || true
fi
rm -f "${CLOSE_BODY_FILE}"
CLOSE_REQ='{"state":"closed"}'
CLOSE_RESP=$(mktemp)
set +e
CL_STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-X PATCH -d "${CLOSE_REQ}" \
-o "${CLOSE_RESP}" -w "%{http_code}" \
"${API}/pulls/${PR_NUMBER}")
set -e
if [ "${CL_STATUS}" = "201" ] || [ "${CL_STATUS}" = "200" ]; then
echo "::notice::Closed PR #${PR_NUMBER} as redundant"
echo "outcome=closed-as-duplicate" >> "$GITHUB_OUTPUT"
rm -f "${CLOSE_RESP}"
exit 0
fi
echo "::error::Failed to close redundant PR: HTTP ${CL_STATUS}"
cat "${CLOSE_RESP}" | head -c 300 || true
rm -f "${CLOSE_RESP}"
exit 1
fi
echo "::error::Retarget PATCH failed and was NOT a duplicate-base error:"
echo "$PATCH_OUTPUT" >&2
echo "::error::Retarget PATCH failed and was NOT a duplicate-base error: HTTP ${STATUS}"
echo "${BODY}" | head -c 500 >&2
exit 1
- name: Post explainer comment
if: steps.retarget.outputs.outcome == 'retargeted'
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GITEA_TOKEN: ${{ secrets.AUTO_SYNC_TOKEN }}
GITEA_HOST: ${{ vars.GITEA_HOST || 'https://git.moleculesai.app' }}
REPO: ${{ github.repository }}
PR_NUMBER: ${{ github.event.pull_request.number }}
run: |
gh pr comment "$PR_NUMBER" \
--repo "${{ github.repository }}" \
--body "$(cat <<'BODY'
[retarget-bot] This PR was opened against `main` and has been retargeted to `staging` automatically.
set -euo pipefail
**Why:** per [SHARED_RULES rule 8](https://github.com/molecule-ai/molecule-ai-org-template-molecule-dev/blob/main/SHARED_RULES.md), all feature work targets `staging` first; the CEO promotes `staging → main` separately.
API="${GITEA_HOST}/api/v1/repos/${REPO}"
AUTH=(-H "Authorization: token ${GITEA_TOKEN}" -H "Accept: application/json")
**What changed:** just the base branch — no code change. CI will re-run against `staging`. If you get merge conflicts, rebase on `staging`.
# PR comments live on the issue endpoint in Gitea
# (PRs ARE issues — same endpoint, different sub-resources
# for diffs/files/etc.). The body uses jq to safely
# encode the multi-line markdown without shell-quote
# nightmares.
REQ=$(jq -n '{body:"[retarget-bot] This PR was opened against `main` and has been retargeted to `staging` automatically.\n\n**Why:** per [SHARED_RULES rule 8](https://git.moleculesai.app/molecule-ai/molecule-ai-org-template-molecule-dev/src/branch/main/SHARED_RULES.md), all feature work targets `staging` first; the CEO promotes `staging → main` separately.\n\n**What changed:** just the base branch — no code change. CI will re-run against `staging`. If you get merge conflicts, rebase on `staging`.\n\n**If this PR is the CEO`s staging→main promotion:** the Action skipped you (only bot-authored PRs are retargeted, head=staging is also exempted). If you see this comment on your CEO PR, that`s a bug — please tag @hongmingwang."}')
**If this PR is the CEO's staging→main promotion:** the Action skipped you (only bot-authored PRs are retargeted). If you see this comment on your CEO PR, that's a bug — please tag @HongmingWang-Rabbit.
BODY
)"
BODY_FILE=$(mktemp)
set +e
STATUS=$(curl -sS "${AUTH[@]}" -H "Content-Type: application/json" \
-X POST -d "${REQ}" \
-o "${BODY_FILE}" -w "%{http_code}" \
"${API}/issues/${PR_NUMBER}/comments")
set -e
if [ "${STATUS}" = "201" ]; then
echo "::notice::Posted explainer comment on PR #${PR_NUMBER}"
else
echo "::warning::Failed to post explainer (HTTP ${STATUS}) — retarget itself succeeded"
cat "${BODY_FILE}" | head -c 300 || true
fi
rm -f "${BODY_FILE}"
+26
View File
@@ -7,6 +7,32 @@ export default defineConfig({
test: {
environment: 'node',
exclude: ['e2e/**', 'node_modules/**', '**/dist/**'],
// CI-conditional test timeout (issue #96).
//
// Vitest's 5000ms default is too tight for the first test in any
// file under our CI shape: `npx vitest run --coverage` on the
// self-hosted Gitea Actions Docker runner. The cold-start cost
// (v8 coverage instrumentation init + JSDOM bootstrap + module-
// graph import for @/components/* and @/lib/* + first React
// render) consistently consumes 5-7 seconds for the first
// synchronous test in heavyweight component files
// (ActivityTab.test.tsx, CreateWorkspaceDialog.test.tsx,
// ConfigTab.provider.test.tsx) — even though every subsequent
// test in the same file completes in 100-1500ms.
//
// Empirically the worst observed first-test was 6453ms in a
// single file (CreateWorkspaceDialog). 30000ms gives ~5x
// headroom over that on CI; we still keep 5000ms locally so
// genuine waitFor races / hung promises stay sensitive in dev.
//
// Same vitest pattern documented at:
// https://vitest.dev/config/testtimeout
// https://vitest.dev/guide/coverage#profiling-test-performance
//
// Per-test duration is still emitted to the CI log; if a test
// ever silently approaches 25-30s under this raised ceiling that
// will surface as a duration regression and we revisit.
testTimeout: process.env.CI ? 30000 : 5000,
// Coverage is instrumented but NOT yet a CI gate — first land
// observability so we can see the baseline, then dial in
// thresholds + a hard gate in a follow-up PR (#1815). Today's
@@ -58,8 +58,11 @@ green — proves wire shape end-to-end against a real `hermes gateway run`
subprocess + stub OpenAI-compat LLM. Caught + fixed a real `KeyError`
in upstream `hermes_cli/tools_config.py` (PLATFORMS dict lookup
crashed on plugin platforms) — fix on the patched fork branch
(`HongmingWang-Rabbit/hermes-agent` `feat/platform-adapter-plugins`,
commit `18e4849e`). Upstream PR #18775 OPEN; CONFLICTING with main.
(`molecule-ai/hermes-agent` `feat/platform-adapter-plugins`, commit
`18e4849e`, hosted on Gitea at
`https://git.moleculesai.app/molecule-ai/hermes-agent` — moved from the
suspended `github.com/HongmingWang-Rabbit/hermes-agent`, see
`molecule-ai/internal#72`). Upstream PR #18775 OPEN; CONFLICTING with main.
Not on critical path for our platform — patched fork is what the
workspace image installs.
@@ -0,0 +1,137 @@
# Runbook — Handlers Postgres Integration port-collision substrate
**Status:** Resolved 2026-05-08 (PR for class B Hongming-owned CICD red sweep).
## Symptom
`Handlers Postgres Integration` workflow fails on staging push and PRs.
Step `Apply migrations to Postgres service` shows:
```
psql: error: connection to server at "127.0.0.1", port 5432 failed: Connection refused
```
Job-cleanup step further down logs:
```
Cleaning up services for job Handlers Postgres Integration
failed to remove container: Error response from daemon: No such container: <id>
```
…confirming the postgres service container was already gone before
cleanup ran.
## Root cause
Our Gitea act_runner (operator host `5.78.80.188`,
`/opt/molecule/runners/config.yaml`) sets:
```yaml
container:
network: host
```
…which act_runner applies to BOTH the job container AND every
`services:` container in a workflow. Multiple workflow instances
running concurrently across the 16 parallel runners each try to bind
postgres on `0.0.0.0:5432`. The first wins; subsequent instances exit
immediately with:
```
LOG: could not bind IPv4 address "0.0.0.0": Address in use
HINT: Is another postmaster already running on port 5432?
FATAL: could not create any TCP/IP sockets
```
act_runner sets `AutoRemove:true` on service containers, so Docker
garbage-collects them as soon as they exit. By the time the migrations
step runs `pg_isready` / `psql`, the container is gone and connection
refused.
Reproduction (operator host):
```bash
docker run --rm -d --name pg-A --network host \
-e POSTGRES_PASSWORD=test postgres:15-alpine
docker run -d --name pg-B --network host \
-e POSTGRES_PASSWORD=test postgres:15-alpine
docker logs pg-B # FATAL: could not create any TCP/IP sockets
```
## Why per-job override doesn't work
The natural fix — per-job `container.network` override — is silently
ignored by act_runner. The runner log emits:
```
--network and --net in the options will be ignored.
```
This is a documented act_runner constraint: container network is a
runner-wide setting, not per-job. Source: gitea/act_runner config docs
+ vegardit/docker-gitea-act-runner issue #7.
Flipping the global `container.network` to `bridge` would break every
other workflow in the repo (cache server discovery,
`molecule-monorepo-net` peer access during integration tests, etc.) —
unacceptable blast radius for a per-test bug.
## Fix shape
`handlers-postgres-integration.yml` no longer uses `services: postgres:`.
It launches a sibling postgres container manually on the existing
`molecule-monorepo-net` bridge network with a per-run unique name:
```yaml
env:
PG_NAME: pg-handlers-${{ github.run_id }}-${{ github.run_attempt }}
PG_NETWORK: molecule-monorepo-net
steps:
- name: Start sibling Postgres on bridge network
run: |
docker run -d --name "${PG_NAME}" --network "${PG_NETWORK}" \
...
postgres:15-alpine
PG_HOST=$(docker inspect "${PG_NAME}" \
--format "{{(index .NetworkSettings.Networks \"${PG_NETWORK}\").IPAddress}}")
echo "PG_HOST=${PG_HOST}" >> "$GITHUB_ENV"
# … migrations + tests use ${PG_HOST}, not 127.0.0.1 …
- if: always() && …
name: Stop sibling Postgres
run: docker rm -f "${PG_NAME}" || true
```
The host-net job container can reach a bridge-net container via the
bridge IP directly (verified manually, 2026-05-08). Two parallel runs
use different names + different bridge IPs — no collision.
## Future-proofing
Other workflows that hit the same shape (any `services:` with a
fixed-port image) will exhibit the same failure mode under
host-network runner config. Translate using this same pattern:
1. Drop the `services:` block.
2. Use `${{ github.run_id }}-${{ github.run_attempt }}` for unique
container name.
3. Launch on `molecule-monorepo-net` (already trusted bridge in
`docker-compose.infra.yml`).
4. Read back the bridge IP via `docker inspect` and export as a step env.
5. `if: always()` cleanup step at the end.
If the count of such workflows grows, factor into a composite action
(`./.github/actions/sibling-postgres`) so the substrate logic lives
in one place.
## Related
- Issue #88 (closed by #92): localhost → 127.0.0.1 fix that unmasked
this collision; the IPv6 fix is correct, port collision is the new
layer.
- Issue #94 created `molecule-monorepo-net` + `alpine:latest` as
prereqs.
- Saved memory `feedback_act_runner_github_server_url` documents
another act_runner-vs-GHA divergence (server URL).
@@ -0,0 +1,457 @@
package handlers
// eic_tunnel_pool.go — refcounted pool for EIC SSH tunnels keyed on
// instanceID. Reuses one tunnel across N file ops, amortising the
// ssh-keygen + SendSSHPublicKey + open-tunnel + waitForPort cost
// (~3-5s) over multiple cats/finds (~50-200ms each).
//
// Origin: core#11 — canvas detail-panel config + filesystem load
// took ~20s. ConfigTab fans out 4 GETs serially; the slowest is
// /files/config.yaml which dispatches to readFileViaEIC. Without a
// pool, every readFileViaEIC + listFilesViaEIC + writeFileViaEIC +
// deleteFileViaEIC pays the full setup cost even when fired
// back-to-back on the same workspace EC2.
//
// The pool keeps one eicSSHSession alive per instanceID for up to
// poolTTL. SendSSHPublicKey grants a 60s key validity, so poolTTL
// must stay strictly below that to avoid serving requests on a
// just-expired key. We default to 50s with a 10s safety margin.
//
// Concurrency model:
//
// - Single mutex guards the entries map.
// - Slow path (tunnel setup) runs OUTSIDE the lock, gated by an
// "intent" placeholder so concurrent acquires for the same
// instanceID don't both build a tunnel — the loser drops its
// setup and uses the winner's.
// - Refcount on each entry; eviction blocked while refcount > 0.
// - Janitor goroutine sweeps every poolJanitorInterval, drops
// entries where refcount == 0 && expiresAt < now.
//
// Test injection:
//
// - poolSetupTunnel is a package-level var so tests can swap the
// slow path for a counting stub. Production wires it to
// realWithEICTunnel-style setup.
// - withEICTunnel (the public, single-shot API) is also a var
// (already, see template_files_eic.go). It's rebound here to
// pooledWithEICTunnel which routes through globalEICTunnelPool.
// - Tests that need single-shot behaviour can set poolTTL = 0,
// which makes pooledWithEICTunnel fall through to the underlying
// setup directly (no pool entry kept).
import (
"context"
"fmt"
"sync"
"time"
)
// poolTTL is the maximum age of a pooled tunnel. Must be strictly
// less than the SendSSHPublicKey grant window (60s) so we never
// serve a request through a key that's about to expire mid-op.
//
// Configurable via init-time wiring (see initEICTunnelPool); not a
// const so tests can pin TTL=0 (disable pooling) or TTL=50ms (drive
// eviction tests).
var poolTTL = 50 * time.Second
// poolJanitorInterval is how often the janitor goroutine sweeps for
// expired idle entries. Tighter than poolTTL so eviction is timely;
// loose enough that the goroutine doesn't burn CPU.
var poolJanitorInterval = 10 * time.Second
// poolMaxEntries caps simultaneous instanceIDs the pool tracks.
// Beyond this, new acquires evict the LRU entry. Defends against a
// pathological caller (e.g. a sweep over hundreds of workspace
// EC2s) from leaking unbounded tunnel processes. 32 is a generous
// ceiling for the canvas use case (one human navigates ≤ ~5
// workspaces at a time).
var poolMaxEntries = 32
// poolSetupTunnel is the slow-path tunnel constructor. Wrapped in a
// var so tests can inject a counter stub. Returns a session and a
// cleanup function (closes the open-tunnel subprocess + scrubs the
// ephemeral keydir). nil session + non-nil err means setup failed
// and there is nothing to clean up.
//
// Production wiring lives in eic_tunnel_pool_setup.go (a thin shim
// over the existing realWithEICTunnel logic).
var poolSetupTunnel = func(ctx context.Context, instanceID string) (
sess eicSSHSession, cleanup func(), err error) {
return setupRealEICTunnel(ctx, instanceID)
}
// pooledTunnel is one entry in the pool. session is shared by N
// concurrent fn calls; cleanup runs once when refcount returns to
// zero AND the entry is past expiresAt or evicted.
//
// lastUsed tracks the most recent acquire time for LRU bookkeeping
// (overflow eviction). expiresAt is set at construction and not
// extended on use — a tunnel cannot live past poolTTL even if it's
// hot, because the underlying SendSSHPublicKey grant expires.
type pooledTunnel struct {
session eicSSHSession
cleanup func()
expiresAt time.Time
lastUsed time.Time
refcount int
poisoned bool // true if a fn returned a tunnel-fatal error; do not reuse
}
// eicTunnelPool is the package-level pool. Single instance lives
// in globalEICTunnelPool; constructor runs lazily on first acquire.
type eicTunnelPool struct {
mu sync.Mutex
entries map[string]*pooledTunnel
// pendingSetups guards concurrent setup for the same instanceID.
// First acquirer takes the slot; later ones wait on the channel.
pendingSetups map[string]chan struct{}
stopJanitor chan struct{}
// janitorInterval is captured at pool construction from the
// package-level poolJanitorInterval var. Captured (not re-read on
// every tick) so a test that swaps the package var via t.Cleanup
// after a global pool's janitor is already running can't race
// with that goroutine's ticker read. The global pool is created
// lazily once per process via sync.Once; before this capture
// landed, every test that touched poolJanitorInterval after the
// global pool's first-touch raced the janitor (caught by -race
// on staging tip 249dbc6a — TestPooledWithEICTunnel_PanicPoisonsEntry).
// Tests still get the new value on a freshPool() because they
// set the package var BEFORE calling newEICTunnelPool().
janitorInterval time.Duration
}
var (
globalEICTunnelPool *eicTunnelPool
globalEICTunnelPoolOnce sync.Once
)
// getEICTunnelPool returns the singleton pool, lazy-initialising on
// first call. Idempotent.
func getEICTunnelPool() *eicTunnelPool {
globalEICTunnelPoolOnce.Do(func() {
globalEICTunnelPool = newEICTunnelPool()
go globalEICTunnelPool.janitor()
})
return globalEICTunnelPool
}
// newEICTunnelPool constructs an empty pool. Exported so tests can
// build isolated pools without sharing the singleton.
//
// Captures poolJanitorInterval at construction time so the janitor
// goroutine doesn't race with t.Cleanup-driven swaps of the package
// var. See the janitorInterval field comment for the failure mode.
func newEICTunnelPool() *eicTunnelPool {
return &eicTunnelPool{
entries: map[string]*pooledTunnel{},
pendingSetups: map[string]chan struct{}{},
stopJanitor: make(chan struct{}),
janitorInterval: poolJanitorInterval,
}
}
// acquire returns a usable session for instanceID. If a healthy entry
// exists, refcount++ and return it. If a setup is in flight for the
// same instanceID, wait for it. Otherwise build one (slow path).
//
// done() must be called by the caller when the op finishes. It
// decrements refcount and triggers cleanup if the entry is past
// TTL or poisoned and refcount==0.
//
// Errors from the slow path propagate; pool state is not modified
// for failed setups (no poisoned entry created — that's only for
// fn-returned errors on a previously-good session).
func (p *eicTunnelPool) acquire(ctx context.Context, instanceID string) (
sess eicSSHSession, done func(poisoned bool), err error) {
if poolTTL <= 0 {
// Pool disabled (TTL=0 mode for tests / opt-out). Fall
// through to a direct setup with caller-driven cleanup.
s, cleanup, err := poolSetupTunnel(ctx, instanceID)
if err != nil {
return eicSSHSession{}, nil, err
}
return s, func(_ bool) { cleanup() }, nil
}
for {
p.mu.Lock()
if pt, ok := p.entries[instanceID]; ok && !pt.poisoned && pt.expiresAt.After(time.Now()) {
pt.refcount++
pt.lastUsed = time.Now()
p.mu.Unlock()
return pt.session, p.releaser(instanceID, pt), nil
}
// Either no entry, expired entry, or poisoned entry. If a
// setup is already in flight, wait and retry.
if pending, ok := p.pendingSetups[instanceID]; ok {
p.mu.Unlock()
select {
case <-pending:
continue // re-check the entries map
case <-ctx.Done():
return eicSSHSession{}, nil, ctx.Err()
}
}
// Drop expired/poisoned entry now (we'll cleanup outside
// the lock — the entry is unreferenced or we'd not be here).
var oldCleanup func()
if pt, ok := p.entries[instanceID]; ok {
if pt.refcount == 0 {
oldCleanup = pt.cleanup
delete(p.entries, instanceID)
}
}
// Reserve the setup slot.
signal := make(chan struct{})
p.pendingSetups[instanceID] = signal
p.mu.Unlock()
if oldCleanup != nil {
go oldCleanup()
}
// Slow path: build a new tunnel. Anything that goes wrong
// here cleans up the pendingSetups slot and propagates to
// the caller without leaving the pool in a state where the
// next acquire blocks waiting on a signal that never fires.
newSess, cleanup, setupErr := poolSetupTunnel(ctx, instanceID)
p.mu.Lock()
delete(p.pendingSetups, instanceID)
close(signal)
if setupErr != nil {
p.mu.Unlock()
return eicSSHSession{}, nil, fmt.Errorf("eic tunnel setup: %w", setupErr)
}
// Enforce LRU bound BEFORE inserting so we don't briefly
// exceed the cap even by one entry.
p.evictLRUIfFullLocked(instanceID)
pt := &pooledTunnel{
session: newSess,
cleanup: cleanup,
expiresAt: time.Now().Add(poolTTL),
lastUsed: time.Now(),
refcount: 1,
}
p.entries[instanceID] = pt
p.mu.Unlock()
return pt.session, p.releaser(instanceID, pt), nil
}
}
// releaser returns a closure that decrements refcount and triggers
// cleanup if (a) the entry is past TTL or (b) the caller signalled
// poison. Idempotent against double-release (decrements once via the
// captured pt; pool entry may have been replaced by then).
func (p *eicTunnelPool) releaser(instanceID string, pt *pooledTunnel) func(poisoned bool) {
released := false
return func(poisoned bool) {
p.mu.Lock()
defer p.mu.Unlock()
if released {
return
}
released = true
pt.refcount--
if poisoned {
pt.poisoned = true
}
// Evict immediately if poisoned-and-idle OR expired-and-idle.
// Hot entries (refcount > 0) defer eviction to the last release.
if pt.refcount == 0 && (pt.poisoned || pt.expiresAt.Before(time.Now())) {
// If the entry in the map is still us, remove it.
if cur, ok := p.entries[instanceID]; ok && cur == pt {
delete(p.entries, instanceID)
}
go pt.cleanup()
}
}
}
// evictLRUIfFullLocked drops the least-recently-used IDLE entry
// when the pool is at capacity. Caller must hold p.mu. The new
// instanceID about to be inserted is excluded so we don't evict
// ourselves. If no idle entries exist, no eviction happens — the
// new entry will push us above the soft cap until something releases.
func (p *eicTunnelPool) evictLRUIfFullLocked(skipInstance string) {
if len(p.entries) < poolMaxEntries {
return
}
var oldestKey string
var oldest *pooledTunnel
for k, pt := range p.entries {
if k == skipInstance {
continue
}
if pt.refcount > 0 {
continue
}
if oldest == nil || pt.lastUsed.Before(oldest.lastUsed) {
oldestKey = k
oldest = pt
}
}
if oldest == nil {
return // every entry is in use; no eviction possible
}
delete(p.entries, oldestKey)
go oldest.cleanup()
}
// janitor periodically scans for entries that are idle AND expired,
// closing their tunnels. Runs forever (per pool lifetime); cancelled
// by close(p.stopJanitor) for tests that build short-lived pools.
//
// Reads p.janitorInterval (captured at construction) instead of the
// package-level poolJanitorInterval — see janitorInterval field comment.
func (p *eicTunnelPool) janitor() {
t := time.NewTicker(p.janitorInterval)
defer t.Stop()
for {
select {
case <-t.C:
p.sweep()
case <-p.stopJanitor:
return
}
}
}
// sweep is one janitor pass. Drops idle expired entries.
func (p *eicTunnelPool) sweep() {
p.mu.Lock()
now := time.Now()
var toClose []func()
for k, pt := range p.entries {
if pt.refcount == 0 && pt.expiresAt.Before(now) {
toClose = append(toClose, pt.cleanup)
delete(p.entries, k)
}
}
p.mu.Unlock()
for _, c := range toClose {
go c()
}
}
// stop terminates the janitor and closes all idle entries. Hot
// (refcount > 0) entries are NOT force-closed — callers running
// against them would see a use-after-free. In practice stop is only
// called by tests that have already drained their callers.
func (p *eicTunnelPool) stop() {
close(p.stopJanitor)
p.mu.Lock()
defer p.mu.Unlock()
for k, pt := range p.entries {
if pt.refcount == 0 {
go pt.cleanup()
delete(p.entries, k)
}
}
}
// pooledWithEICTunnel is the pool-backed replacement for
// realWithEICTunnel. The signature matches `var withEICTunnel`
// exactly so the rebind (in initEICTunnelPool) is a drop-in.
//
// Errors from `fn` itself are forwarded to the caller AND mark the
// pool entry as poisoned, so the next acquire builds a fresh
// tunnel. This catches the case where the workspace EC2 was
// restarted out-of-band (tunnel still appears alive locally but
// every cat/find errors out).
func pooledWithEICTunnel(ctx context.Context, instanceID string,
fn func(s eicSSHSession) error) error {
pool := getEICTunnelPool()
sess, done, err := pool.acquire(ctx, instanceID)
if err != nil {
return err
}
// poisoned defaults to true so a panic from fn poisons the
// entry on the way through the deferred release. Without the
// defer, a panicking fn would leak refcount=1 forever and
// permanently block eviction of this entry. The fn-error path
// resets poisoned to its real classification before return.
poisoned := true
defer func() { done(poisoned) }()
fnErr := fn(sess)
poisoned = fnErrIndicatesTunnelFault(fnErr)
return fnErr
}
// fnErrIndicatesTunnelFault returns true for fn errors whose nature
// suggests the underlying tunnel is no longer reusable (auth gone,
// network gone, ssh process dead). Returning true poisons the pool
// entry so the next acquire builds fresh.
//
// Conservative: only marks tunnel-faulty for clearly tunnel-level
// failures (connection refused, broken pipe, ssh exit-status from
// fatal-channel signals). A `cat` returning os.ErrNotExist on a
// missing file is NOT a tunnel fault — that's the file path being
// wrong, the tunnel is fine.
func fnErrIndicatesTunnelFault(err error) bool {
if err == nil {
return false
}
msg := err.Error()
// stderr substrings produced by ssh when the tunnel is broken.
for _, marker := range []string{
"connection refused",
"connection closed",
"broken pipe",
"Connection reset by peer",
"kex_exchange_identification",
"port forwarding failed",
"Permission denied",
"Authentication failed",
} {
if containsCaseInsensitive(msg, marker) {
return true
}
}
return false
}
// containsCaseInsensitive avoids importing strings just for this
// (the file already needs ssh stderr matching elsewhere — this
// keeps the helper local to avoid a cross-file dependency).
func containsCaseInsensitive(s, substr string) bool {
if len(substr) > len(s) {
return false
}
// Manual lowercase compare loop; ssh error markers are ASCII so
// no need for unicode-aware folding.
low := func(b byte) byte {
if b >= 'A' && b <= 'Z' {
return b + 32
}
return b
}
for i := 0; i+len(substr) <= len(s); i++ {
match := true
for j := 0; j < len(substr); j++ {
if low(s[i+j]) != low(substr[j]) {
match = false
break
}
}
if match {
return true
}
}
return false
}
// initEICTunnelPool rebinds the package-level withEICTunnel var to
// the pooled implementation. Called once at package init via the
// init() in eic_tunnel_pool_setup.go (split file so the rebind
// itself is testable without dragging in the production setup
// shim's exec/aws dependencies).
func initEICTunnelPool() {
withEICTunnel = pooledWithEICTunnel
}
@@ -0,0 +1,467 @@
package handlers
// eic_tunnel_pool_test.go — tests for the refcounted EIC tunnel pool
// added in core#11. Stubs poolSetupTunnel with a counter so the
// tests don't fork ssh-keygen / aws subprocesses.
//
// Per memory feedback_assert_exact_not_substring: each test pins
// exact expected counts (not "at least N") so a regression that
// silently double-sets-up surfaces here.
import (
"context"
"errors"
"sync"
"sync/atomic"
"testing"
"time"
)
// withPoolSetupStub swaps poolSetupTunnel for a counting fake that
// returns a sentinel session and a cleanup func that records its
// invocation. Restores on test cleanup.
//
// setupSignal blocks each setup until released — for concurrent-
// acquire tests where we want to gate setup completion.
func withPoolSetupStub(t *testing.T) (
setupCount *int64, cleanupCount *int64, restore func(), unblock func()) {
t.Helper()
prev := poolSetupTunnel
prevTTL := poolTTL
prevJanitor := poolJanitorInterval
var sc, cc int64
setupCount, cleanupCount = &sc, &cc
gate := make(chan struct{}, 1)
gate <- struct{}{} // allow the first setup through immediately
unblock = func() { gate <- struct{}{} }
poolSetupTunnel = func(ctx context.Context, instanceID string) (
eicSSHSession, func(), error) {
select {
case <-gate:
case <-ctx.Done():
return eicSSHSession{}, nil, ctx.Err()
}
atomic.AddInt64(&sc, 1)
sess := eicSSHSession{
instanceID: instanceID,
osUser: "ubuntu",
localPort: 10000 + int(atomic.LoadInt64(&sc)),
keyPath: "/tmp/molecule-eic-test-" + instanceID,
}
cleanup := func() { atomic.AddInt64(&cc, 1) }
return sess, cleanup, nil
}
restore = func() {
poolSetupTunnel = prev
poolTTL = prevTTL
poolJanitorInterval = prevJanitor
}
t.Cleanup(restore)
return
}
// freshPool returns an isolated pool (NOT the global) so tests run
// independently. Stops the janitor on cleanup.
func freshPool(t *testing.T) *eicTunnelPool {
t.Helper()
p := newEICTunnelPool()
t.Cleanup(p.stop)
return p
}
// TestEICTunnelPool_FourOpsAmortise pins the core invariant: four
// sequential acquire/release cycles on the same instanceID share
// ONE underlying tunnel setup. Mutation: delete the cache hit branch
// in acquire() → setupCount goes 1 → 4 → test fails.
func TestEICTunnelPool_FourOpsAmortise(t *testing.T) {
setupCount, cleanupCount, _, _ := withPoolSetupStub(t)
// Refill gate after each setup so concurrent stubs aren't blocked
// (we want every test to be able to set up if it needs to).
t.Cleanup(func() { /* no-op; defer is enough */ })
poolTTL = 50 * time.Second
pool := freshPool(t)
ctx := context.Background()
for i := 0; i < 4; i++ {
sess, done, err := pool.acquire(ctx, "i-test-1")
if err != nil {
t.Fatalf("op %d: acquire: %v", i, err)
}
if sess.instanceID != "i-test-1" {
t.Fatalf("op %d: session has wrong instanceID: %q", i, sess.instanceID)
}
done(false)
}
if got := atomic.LoadInt64(setupCount); got != 1 {
t.Errorf("expected exactly 1 tunnel setup across 4 ops, got %d", got)
}
if got := atomic.LoadInt64(cleanupCount); got != 0 {
t.Errorf("expected 0 cleanups while entry is hot (TTL=50s), got %d", got)
}
}
// TestEICTunnelPool_DifferentInstancesDoNotShare pins that two
// different instanceIDs each get their own tunnel — the pool is
// keyed on instanceID, not a single global slot.
func TestEICTunnelPool_DifferentInstancesDoNotShare(t *testing.T) {
setupCount, _, _, unblock := withPoolSetupStub(t)
poolTTL = 50 * time.Second
pool := freshPool(t)
ctx := context.Background()
// First instance setup uses the initial gate slot.
_, doneA, err := pool.acquire(ctx, "i-a")
if err != nil {
t.Fatalf("acquire A: %v", err)
}
doneA(false)
// Second instance needs a new slot through the gate.
unblock()
_, doneB, err := pool.acquire(ctx, "i-b")
if err != nil {
t.Fatalf("acquire B: %v", err)
}
doneB(false)
if got := atomic.LoadInt64(setupCount); got != 2 {
t.Errorf("expected 2 setups (one per instance), got %d", got)
}
}
// TestEICTunnelPool_TTLEviction: a short TTL forces the second op
// to build a fresh tunnel after the first expires.
func TestEICTunnelPool_TTLEviction(t *testing.T) {
setupCount, cleanupCount, _, unblock := withPoolSetupStub(t)
poolTTL = 50 * time.Millisecond
poolJanitorInterval = 1 * time.Second // keep janitor away
pool := freshPool(t)
ctx := context.Background()
_, done, err := pool.acquire(ctx, "i-ttl")
if err != nil {
t.Fatalf("acquire 1: %v", err)
}
done(false)
time.Sleep(80 * time.Millisecond) // past TTL
unblock() // allow next setup
_, done, err = pool.acquire(ctx, "i-ttl")
if err != nil {
t.Fatalf("acquire 2: %v", err)
}
done(false)
if got := atomic.LoadInt64(setupCount); got != 2 {
t.Errorf("expected 2 setups (TTL eviction between), got %d", got)
}
// First entry should have been cleaned up when the second
// acquire evicted it on the slow path. Cleanup runs in a
// goroutine; poll briefly for it to land.
deadline := time.Now().Add(500 * time.Millisecond)
for atomic.LoadInt64(cleanupCount) < 1 && time.Now().Before(deadline) {
time.Sleep(5 * time.Millisecond)
}
if got := atomic.LoadInt64(cleanupCount); got < 1 {
t.Errorf("expected ≥1 cleanup (first entry evicted), got %d", got)
}
}
// TestEICTunnelPool_FailureInvalidates pins the poison-on-fault
// behavior — fn returning a tunnel-fatal error marks the entry
// unusable so the next acquire builds fresh.
func TestEICTunnelPool_FailureInvalidates(t *testing.T) {
setupCount, _, _, unblock := withPoolSetupStub(t)
poolTTL = 50 * time.Second
pool := freshPool(t)
ctx := context.Background()
_, done, err := pool.acquire(ctx, "i-fault")
if err != nil {
t.Fatalf("acquire 1: %v", err)
}
done(true) // signal poison
unblock() // let the next setup through
_, done, err = pool.acquire(ctx, "i-fault")
if err != nil {
t.Fatalf("acquire 2: %v", err)
}
done(false)
if got := atomic.LoadInt64(setupCount); got != 2 {
t.Errorf("expected 2 setups (poison forced rebuild), got %d", got)
}
}
// TestEICTunnelPool_ConcurrentAcquireSingleSetup pins that N
// concurrent acquires for the same instanceID before any release
// only trigger ONE tunnel setup — the rest wait via pendingSetups.
//
// Without this guard each concurrent acquire would spawn its own
// tunnel and the loser-cleanup would still leak refcount. Mutation:
// delete the pendingSetups gate → setupCount goes 1 → N → fails.
func TestEICTunnelPool_ConcurrentAcquireSingleSetup(t *testing.T) {
setupCount, _, _, _ := withPoolSetupStub(t)
// Pause setup so all goroutines pile into the pending slot.
prev := poolSetupTunnel
gate := make(chan struct{})
poolSetupTunnel = func(ctx context.Context, instanceID string) (
eicSSHSession, func(), error) {
<-gate
atomic.AddInt64(setupCount, 1)
return eicSSHSession{instanceID: instanceID}, func() {}, nil
}
t.Cleanup(func() { poolSetupTunnel = prev })
poolTTL = 50 * time.Second
pool := freshPool(t)
ctx := context.Background()
const N = 8
type result struct {
done func(bool)
err error
}
results := make(chan result, N)
var startWg sync.WaitGroup
startWg.Add(N)
for i := 0; i < N; i++ {
go func() {
startWg.Done()
_, done, err := pool.acquire(ctx, "i-concurrent")
results <- result{done, err}
}()
}
startWg.Wait()
// give all N goroutines time to enter pool.acquire
time.Sleep(20 * time.Millisecond)
close(gate)
for i := 0; i < N; i++ {
r := <-results
if r.err != nil {
t.Fatalf("acquire %d: %v", i, r.err)
}
r.done(false)
}
if got := atomic.LoadInt64(setupCount); got != 1 {
t.Errorf("expected 1 setup across %d concurrent acquires, got %d", N, got)
}
}
// TestEICTunnelPool_TTLZeroDisablesPooling pins the escape hatch:
// poolTTL=0 means every acquire goes straight through to setup +
// cleanup, no entry kept. Useful for tests / opt-out.
func TestEICTunnelPool_TTLZeroDisablesPooling(t *testing.T) {
setupCount, cleanupCount, _, unblock := withPoolSetupStub(t)
poolTTL = 0
pool := freshPool(t)
ctx := context.Background()
_, done, err := pool.acquire(ctx, "i-ttlzero")
if err != nil {
t.Fatalf("acquire 1: %v", err)
}
done(false)
unblock()
_, done, err = pool.acquire(ctx, "i-ttlzero")
if err != nil {
t.Fatalf("acquire 2: %v", err)
}
done(false)
if got := atomic.LoadInt64(setupCount); got != 2 {
t.Errorf("expected 2 setups with TTL=0 (pool disabled), got %d", got)
}
if got := atomic.LoadInt64(cleanupCount); got != 2 {
t.Errorf("expected 2 cleanups with TTL=0 (each release closes), got %d", got)
}
}
// TestEICTunnelPool_LRUEvictionAtCap pins the LRU defence: when the
// pool reaches poolMaxEntries, a new acquire for an unseen
// instanceID evicts the LRU idle entry instead of growing unbounded.
func TestEICTunnelPool_LRUEvictionAtCap(t *testing.T) {
setupCount, cleanupCount, _, _ := withPoolSetupStub(t)
prev := poolMaxEntries
poolMaxEntries = 2
t.Cleanup(func() { poolMaxEntries = prev })
poolTTL = 50 * time.Second
// Replace stub with one that doesn't gate so we can fill quickly.
poolSetupTunnel = func(ctx context.Context, instanceID string) (
eicSSHSession, func(), error) {
atomic.AddInt64(setupCount, 1)
return eicSSHSession{instanceID: instanceID}, func() {
atomic.AddInt64(cleanupCount, 1)
}, nil
}
pool := freshPool(t)
ctx := context.Background()
for _, id := range []string{"i-1", "i-2"} {
_, done, err := pool.acquire(ctx, id)
if err != nil {
t.Fatalf("acquire %s: %v", id, err)
}
done(false)
}
// Both entries idle, pool at cap.
_, done, err := pool.acquire(ctx, "i-3")
if err != nil {
t.Fatalf("acquire i-3: %v", err)
}
done(false)
// Wait for the goroutine'd cleanup of the evicted entry.
deadline := time.Now().Add(500 * time.Millisecond)
for atomic.LoadInt64(cleanupCount) < 1 && time.Now().Before(deadline) {
time.Sleep(10 * time.Millisecond)
}
if got := atomic.LoadInt64(setupCount); got != 3 {
t.Errorf("expected 3 setups (one per unique instance), got %d", got)
}
if got := atomic.LoadInt64(cleanupCount); got < 1 {
t.Errorf("expected ≥1 cleanup (LRU eviction), got %d", got)
}
}
// TestEICTunnelPool_PoisonedClassification pins the heuristic that
// distinguishes tunnel-fatal errors (poison the entry) from
// app-level errors (file not found, validation) that should NOT
// invalidate the tunnel.
func TestEICTunnelPool_PoisonedClassification(t *testing.T) {
cases := []struct {
name string
err error
want bool
}{
{"nil", nil, false},
{"file not found", errors.New("os: file does not exist"), false},
{"validation", errors.New("invalid path: must be relative"), false},
{"connection refused", errors.New("ssh: connect to host: connection refused"), true},
{"connection refused upper", errors.New("Connection Refused"), true},
{"broken pipe", errors.New("write tunnel: broken pipe"), true},
{"permission denied", errors.New("Permission denied (publickey)"), true},
{"auth failed", errors.New("Authentication failed"), true},
{"connection reset", errors.New("Connection reset by peer"), true},
{"port forward", errors.New("port forwarding failed"), true},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
got := fnErrIndicatesTunnelFault(tc.err)
if got != tc.want {
t.Errorf("fnErrIndicatesTunnelFault(%v) = %v, want %v",
tc.err, got, tc.want)
}
})
}
}
// TestEICTunnelPool_RefcountBlocksEviction pins that an entry past
// TTL is NOT evicted while a caller still holds it — preventing
// use-after-free in the holder.
func TestEICTunnelPool_RefcountBlocksEviction(t *testing.T) {
setupCount, cleanupCount, _, _ := withPoolSetupStub(t)
poolTTL = 30 * time.Millisecond
poolJanitorInterval = 5 * time.Millisecond
pool := freshPool(t)
ctx := context.Background()
_, done, err := pool.acquire(ctx, "i-hold")
if err != nil {
t.Fatalf("acquire: %v", err)
}
// Sleep past TTL while holding the session. Janitor sweeps
// every 5ms but must skip our entry because refcount=1.
time.Sleep(80 * time.Millisecond)
if got := atomic.LoadInt64(cleanupCount); got != 0 {
t.Errorf("expected 0 cleanups while holder is active, got %d", got)
}
done(false)
// Now refcount=0 and entry is past TTL; releaser triggers cleanup.
deadline := time.Now().Add(200 * time.Millisecond)
for atomic.LoadInt64(cleanupCount) < 1 && time.Now().Before(deadline) {
time.Sleep(5 * time.Millisecond)
}
if got := atomic.LoadInt64(cleanupCount); got != 1 {
t.Errorf("expected 1 cleanup after release of expired entry, got %d", got)
}
if got := atomic.LoadInt64(setupCount); got != 1 {
t.Errorf("setupCount tracking: got %d, want 1", got)
}
}
// TestPooledWithEICTunnel_PanicPoisonsEntry pins that a panic
// from fn poisons the pool entry on the way out — refcount goes
// back to zero (no leak) and the entry is marked unusable so the
// next acquire builds fresh. Without the defer-release pattern, a
// panic would leave refcount=1 forever and the entry would never
// evict.
func TestPooledWithEICTunnel_PanicPoisonsEntry(t *testing.T) {
setupCount, _, _, _ := withPoolSetupStub(t)
poolTTL = 50 * time.Second
globalEICTunnelPool = newEICTunnelPool()
t.Cleanup(globalEICTunnelPool.stop)
func() {
defer func() {
if r := recover(); r == nil {
t.Errorf("expected panic to bubble up, got nil")
}
}()
_ = pooledWithEICTunnel(context.Background(), "i-panic",
func(s eicSSHSession) error { panic("boom") })
}()
// Replenish the gate so the next setup can run.
prev := poolSetupTunnel
poolSetupTunnel = func(ctx context.Context, instanceID string) (
eicSSHSession, func(), error) {
atomic.AddInt64(setupCount, 1)
return eicSSHSession{instanceID: instanceID}, func() {}, nil
}
t.Cleanup(func() { poolSetupTunnel = prev })
// Next acquire must build fresh — entry was poisoned by panic.
if err := pooledWithEICTunnel(context.Background(), "i-panic",
func(s eicSSHSession) error { return nil }); err != nil {
t.Fatalf("post-panic acquire: %v", err)
}
if got := atomic.LoadInt64(setupCount); got != 2 {
t.Errorf("expected 2 setups (panic poisoned, rebuild), got %d", got)
}
}
// TestPooledWithEICTunnel_PreservesFnErr pins that errors from the
// inner fn pass through to the caller verbatim — pool wrapping
// should not swallow or transform error semantics for app code.
func TestPooledWithEICTunnel_PreservesFnErr(t *testing.T) {
withPoolSetupStub(t)
poolTTL = 50 * time.Second
// Reset the global pool so this test is isolated from any prior
// test that may have populated it.
globalEICTunnelPool = newEICTunnelPool()
want := errors.New("file does not exist")
got := pooledWithEICTunnel(context.Background(), "i-fn-err",
func(s eicSSHSession) error { return want })
if !errors.Is(got, want) {
t.Errorf("pooledWithEICTunnel returned %v, want %v", got, want)
}
}