Compare commits

..

5 Commits

Author SHA1 Message Date
Hongming Wang 4279fecde5 fix(ci): keep codex in TEMPLATES + skip-if-no-publish-image.yml
The v2 dropped codex from TEMPLATES on the basis of "no
publish-image.yml = not part of cascade today." That was correct
about the immediate behavior but tripped cascade-list-drift-gate.yml
because manifest.json still declares codex (it IS a live runtime —
referenced from workspace/config.py and cloned into dev envs by
clone-manifest.sh; only the image-publish path is missing).

Restore codex to TEMPLATES (matching manifest) and add a runtime
soft-skip: probe each repo for .github/workflows/publish-image.yml
via the Gitea contents API and skip cleanly if 404. Final job log
distinguishes "complete across all" vs "complete with soft-skips".

This preserves the drift gate's invariant (TEMPLATES == manifest)
while honoring the empirical fact that codex has no publish-image
workflow yet. If codex later gains the workflow, no change here is
needed — the probe will see 200 and the cascade will fan out to it
naturally.

Refs molecule-core#14, molecule-core#20.
2026-05-07 03:32:53 -07:00
Hongming Wang 607444e71b feat(ci): replace curl-dispatch with push-mode cascade (v2)
Empirical blocker on v1: Gitea 1.22.6 has no repository_dispatch /
workflow_dispatch trigger API (verified across 6 candidate paths in
issuecomment-913). v1's curl-POST loop would always exit-1.

v2 pivots to push-mode: each template repo got a small companion PR
(merged 2026-05-07) adding a `.runtime-version` file at root + a
`resolve-version` job in publish-image.yml that reads the file and
forwards the value to the reusable build workflow. publish-runtime
now updates that file via git-clone + commit + push, which trips
each template's existing `on: push: branches: [main]` trigger.

Behaviour changes vs v1:
- Templates list dropped from 9 → 8 (codex has no publish-image.yml
  so was never part of the cascade in practice).
- 3-retry pull-rebase loop per template (handles concurrent-push
  races without force-push). Failures collected, job exits 1 with
  the failed-template list at the end.
- Idempotency: when re-run with the same version, templates already
  pinned to that version contribute zero commits — operator can
  safely re-run to retry partial failures.
- Author line: "publish-runtime cascade <publish-runtime@moleculesai
  .app>" trailer makes it clear the commit is workflow-driven, not
  human (per memory feedback_github_botring_fingerprint).

DISPATCH_TOKEN secret name unchanged (still consumed at
secrets.DISPATCH_TOKEN per 569df259).

Refs molecule-core#14, builds on molecule-core#20 issuecomment-923
(Phase 2 design).
2026-05-07 03:17:38 -07:00
Hongming Wang 1ff7342e91 chore: retrigger CI after runner config fix 2026-05-07 03:01:23 -07:00
Hongming Wang 569df259ba fix(ci): align secret name to plumbed DISPATCH_TOKEN (closes #14)
The cascade workflow was reading from `secrets.TEMPLATE_DISPATCH_TOKEN`
but the plumbed secret name is `DISPATCH_TOKEN` (verified just now via
GET /repos/molecule-ai/molecule-core/actions/secrets — only DISPATCH_TOKEN
is set). Without this rename the cascade would always evaluate "secret
missing" and exit 1 on the next push to staging, defeating the entire
point of grant-role-access.sh --apply that just landed.

Three references updated:
  - env mapping (`secrets.X` → `secrets.DISPATCH_TOKEN`)
  - workflow_dispatch warning text
  - push-trigger error text

The bash-side variable name is unchanged (still `DISPATCH_TOKEN`) so
the curl invocation at line 372 is unaffected. YAML round-trip parses
clean.
2026-05-07 02:38:20 -07:00
claude-ceo-assistant ce3f1f48a4 fix(ci): port publish-runtime cascade to Gitea repo-dispatch API (closes molecule-core#14)
## Symptom

`publish-runtime.yml::cascade` fired a `repository_dispatch` to 10 workspace-template
repos via direct curl to `https://api.github.com/repos/...`. Post-2026-05-06 the
org's GitHub presence is suspended; every invocation 404s. The job's
`::warning::` posture meant the failure didn't propagate, leaving the runtime
PyPI publish → template image rebuild pipeline silently broken.

## Why Option A (rewrite) and not Option B (delete)

Verified 2026-05-07 by devops-engineer (molecule-core#14 thread):

- The cron-poll mechanism (/etc/cron.d/molecule-deploy-poll) tracks ONLY the
  Vercel/Railway-deployed repos (landingpage/docs/molecule-app/molecules-market
  /molecule-controlplane). It does NOT track workspace-template-* repos.
- Each of the 9 template `publish-image.yml` workflows has
  `repository_dispatch: types: [runtime-published]` as a load-bearing trigger.
  Without the cascade, when the runtime ships a new PyPI version, templates
  don't auto-rebuild.

So Option B (delete) would silently break the runtime → template fan-out.
Option A (rewrite to Gitea's API shape) is the right call. Security-auditor
agreed after seeing the cron-poll TRACKED list.

## API surface change

| Concern | Pre-fix (GitHub) | Post-fix (Gitea) |
|---|---|---|
| URL | `https://api.github.com/repos/$REPO/dispatches` | `${GITEA_URL}/api/v1/repos/$REPO/dispatches` |
| Owner case | `Molecule-AI/...` | `molecule-ai/...` (lowercase, Gitea is case-sensitive) |
| Auth header | `Authorization: Bearer $DISPATCH_TOKEN` | `Authorization: token $DISPATCH_TOKEN` |
| Body shape | `{event_type, client_payload}` | UNCHANGED — Gitea is GitHub-compatible here |
| Success code | `204 No Content` | `204 No Content` (unchanged) |

`GITEA_URL` defaults to `https://git.moleculesai.app`; overridable via job env.

## Out-of-band: DISPATCH_TOKEN secret rotation

The DISPATCH_TOKEN secret was a GitHub PAT. It must be re-minted as a Gitea
PAT for the new API to authenticate. Per saved memory
`feedback_per_agent_gitea_identity_default`, this should be a dedicated
`publish-runtime-bot` persona token with `write:repository` scope on the
9 target repos — NOT the founder PAT.

This PR ships the workflow change. Token rotation is the operator-host
follow-up (security-auditor's lane) — coordinate the merge so the token
is in place before the next runtime release fires.

## Backwards compatibility

The workflow ran silently-broken since 2026-05-06 (every invocation 404
+ ::warning:: but no failure). So there is no functional regression from
"silently broken" to "actually working". Any in-progress operator-managed
manual dispatch path is unaffected; the Gitea API parallel path doesn't
require operator intervention.

## Test plan

- [x] YAML parse OK on the modified workflow file
- [ ] Smoke test: trigger a runtime publish (or simulate via dispatching to one
      template) post-merge; verify HTTP 204 + the template's publish-image
      workflow fires + the template's image gets re-pushed against the new
      runtime version. Phase 4 verification belongs to internal#46 follow-up.

## Hostile self-review (3 weakest spots)

1. The fan-out remains all-or-nothing: a single template failure surfaces as
   a `::warning::` but PyPI publish proceeds. With 9 templates this is a
   ~10% per-template chance of stale-image-on-runtime-bump if any one fails.
   Defense: the warning shows up in the workflow summary; operators retry.
   Future hardening: requeue-on-fail with bounded retry, or a separate
   reconcile cron that detects template/runtime version drift and re-dispatches.

2. `DISPATCH_TOKEN` validity is enforced by the Gitea API (401 on stale)
   but the workflow doesn't differentiate 401 from 404. Either way the
   warning fires. Future hardening: explicit token-shape check at the start
   of the cascade job (curl `/api/v1/user` once, fail-fast if 401).

3. Owner-case lowercase is right today but couples the workflow to the
   current Gitea org slug. If the org is ever renamed, this workflow
   breaks silently. Less fragile alternative: derive REPO from a
   canonical config (e.g. `gh repo list molecule-ai`) instead of
   string-concatenating. Acceptable today; filed as the same future
   hardening pass as item 1.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 01:31:37 -07:00
19 changed files with 160 additions and 644 deletions
+2 -2
View File
@@ -1,7 +1,7 @@
name: Block internal-flavored paths
# Hard CI gate. Internal content (positioning, competitive briefs, sales
# playbooks, PMM/press drip, draft campaigns) lives in molecule-ai/internal —
# playbooks, PMM/press drip, draft campaigns) lives in Molecule-AI/internal —
# this public monorepo must never re-acquire those paths. CEO directive
# 2026-04-23 after a fleet-wide audit found 79 internal files leaked here.
#
@@ -135,7 +135,7 @@ jobs:
echo "::error::Forbidden internal-flavored paths detected:"
printf "$OFFENDING"
echo ""
echo "These paths belong in molecule-ai/internal, not this public repo."
echo "These paths belong in Molecule-AI/internal, not this public repo."
echo "See docs/internal-content-policy.md for canonical locations."
echo ""
echo "If your file is genuinely public-facing (e.g. a blog post"
+1 -1
View File
@@ -108,7 +108,7 @@ jobs:
echo
echo "One or more canary secrets are unset (\`CANARY_TENANT_URLS\`, \`CANARY_ADMIN_TOKENS\`, \`CANARY_CP_SHARED_SECRET\`)."
echo "Phase 2 canary fleet has not been stood up yet —"
echo "see [canary-tenants.md](https://github.com/molecule-ai/molecule-controlplane/blob/main/docs/canary-tenants.md)."
echo "see [canary-tenants.md](https://github.com/Molecule-AI/molecule-controlplane/blob/main/docs/canary-tenants.md)."
echo
echo "**Skipped — promote-to-latest will NOT auto-fire.** Dispatch \`promote-latest.yml\` manually when ready."
} >> "$GITHUB_STEP_SUMMARY"
+5 -5
View File
@@ -87,7 +87,7 @@ jobs:
run: go mod download
- if: needs.changes.outputs.platform == 'true'
run: go build ./cmd/server
# CLI (molecli) moved to standalone repo: github.com/molecule-ai/molecule-cli
# CLI (molecli) moved to standalone repo: github.com/Molecule-AI/molecule-cli
- if: needs.changes.outputs.platform == 'true'
run: go vet ./... || true
- if: needs.changes.outputs.platform == 'true'
@@ -165,7 +165,7 @@ jobs:
# Strip the package-import prefix so we can match .coverage-allowlist.txt
# entries written as paths relative to workspace-server/.
# Handle both module paths: platform/workspace-server/... and platform/...
rel=$(echo "$file" | sed 's|^github.com/molecule-ai/molecule-monorepo/platform/workspace-server/||; s|^github.com/molecule-ai/molecule-monorepo/platform/||')
rel=$(echo "$file" | sed 's|^github.com/Molecule-AI/molecule-monorepo/platform/workspace-server/||; s|^github.com/Molecule-AI/molecule-monorepo/platform/||')
if echo "$ALLOWLIST" | grep -qxF "$rel"; then
echo "::warning file=workspace-server/$rel::Critical file at ${pct}% coverage (allowlisted, #1823) — fix before expiry."
@@ -243,8 +243,8 @@ jobs:
if-no-files-found: warn
# MCP Server + SDK removed from CI — now in standalone repos:
# - github.com/molecule-ai/molecule-mcp-server (npm CI)
# - github.com/molecule-ai/molecule-sdk-python (PyPI CI)
# - github.com/Molecule-AI/molecule-mcp-server (npm CI)
# - github.com/Molecule-AI/molecule-sdk-python (PyPI CI)
# e2e-api job moved to .github/workflows/e2e-api.yml (issue #458).
# It now has workflow-level concurrency (cancel-in-progress: false) so
@@ -434,5 +434,5 @@ jobs:
fi
# SDK + plugin validation moved to standalone repo:
# github.com/molecule-ai/molecule-sdk-python
# github.com/Molecule-AI/molecule-sdk-python
+1 -1
View File
@@ -62,7 +62,7 @@ jobs:
if: matrix.language == 'go'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: molecule-ai/molecule-ai-plugin-github-app-auth
repository: Molecule-AI/molecule-ai-plugin-github-app-auth
path: molecule-ai-plugin-github-app-auth
token: ${{ secrets.PLUGIN_REPO_PAT || secrets.GITHUB_TOKEN }}
+1 -1
View File
@@ -102,7 +102,7 @@ jobs:
if: needs.detect-changes.outputs.run == 'true'
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: molecule-ai/molecule-ai-plugin-github-app-auth
repository: Molecule-AI/molecule-ai-plugin-github-app-auth
path: molecule-ai-plugin-github-app-auth
token: ${{ secrets.PLUGIN_REPO_PAT || secrets.GITHUB_TOKEN }}
+1 -1
View File
@@ -19,4 +19,4 @@ permissions:
jobs:
disable-auto-merge-on-push:
uses: molecule-ai/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main
uses: Molecule-AI/molecule-ci/.github/workflows/disable-auto-merge-on-push.yml@main
+127 -54
View File
@@ -25,7 +25,7 @@ name: publish-runtime
# 3. Publishes to PyPI via the PyPA Trusted Publisher action (OIDC).
# No static API token is stored — PyPI verifies the workflow's
# OIDC claim against the trusted-publisher config registered for
# molecule-ai-workspace-runtime (molecule-ai/molecule-core,
# molecule-ai-workspace-runtime (Molecule-AI/molecule-core,
# publish-runtime.yml, environment pypi-publish).
#
# After publish: the 8 template repos pick up the new version on their
@@ -166,7 +166,7 @@ jobs:
- name: Publish to PyPI (Trusted Publisher / OIDC)
# PyPI side is configured: project molecule-ai-workspace-runtime →
# publisher molecule-ai/molecule-core, workflow publish-runtime.yml,
# publisher Molecule-AI/molecule-core, workflow publish-runtime.yml,
# environment pypi-publish. The action mints a short-lived OIDC
# token and exchanges it for a PyPI upload credential — no static
# API token in this repo's secrets.
@@ -282,42 +282,33 @@ jobs:
echo "::error::Refusing to fan out cascade against stale or corrupt PyPI surfaces."
exit 1
- name: Fan out repository_dispatch
- name: Fan out via push to .runtime-version
env:
# Fine-grained PAT with `actions:write` on the 8 template repos.
# GITHUB_TOKEN can't fire dispatches across repos — needs an explicit
# token. Stored as a repo secret; rotate per the standard schedule.
DISPATCH_TOKEN: ${{ secrets.TEMPLATE_DISPATCH_TOKEN }}
# Single source of truth: the publish job's output, which handles
# tag/manual-input/auto-bump uniformly. The previous fallback
# (`steps.version.outputs.version` from inside the cascade job)
# was a dead reference — different job, no shared step scope.
# Gitea PAT with write:repository scope on the 8 cascade-active
# template repos. Used here for `git push` (NOT for an API
# dispatch — Gitea 1.22.6 has no repository_dispatch endpoint;
# empirically verified across 6 candidate paths in molecule-
# core#20 issuecomment-913). The push trips each template's
# existing `on: push: branches: [main]` trigger on
# publish-image.yml, which then reads the updated
# .runtime-version via its resolve-version job.
DISPATCH_TOKEN: ${{ secrets.DISPATCH_TOKEN }}
RUNTIME_VERSION: ${{ needs.publish.outputs.version }}
run: |
set +e # don't abort on a single repo failure — collect them all
# Schedule-vs-dispatch behaviour split (hardened 2026-04-28
# after the sweep-cf-orphans soft-skip incident — same class
# of bug):
#
# The earlier "skipping cascade. templates will pick up the
# new version on their own next rebuild" message was wrong —
# templates only build on this dispatch trigger; without it
# they stay pinned to whatever runtime version they last saw.
# A silent skip here means "PyPI is current, templates are
# not" and the gap is invisible until someone notices a
# template still on the old version weeks later.
#
# - push → exit 1 (red CI surfaces the gap)
# - workflow_dispatch → exit 0 with a warning (operator
# ran this ad-hoc; let them rerun
# after fixing the secret)
# Soft-skip on workflow_dispatch when the token is missing
# (operator ad-hoc test); hard-fail on push so unattended
# publishes can't silently skip the cascade. Same shape as
# the original v1, intentional split per the schedule-vs-
# dispatch hardening 2026-04-28.
if [ -z "$DISPATCH_TOKEN" ]; then
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "::warning::TEMPLATE_DISPATCH_TOKEN secret not set — skipping cascade."
echo "::warning::DISPATCH_TOKEN secret not set — skipping cascade."
echo "::warning::set it at Settings → Secrets and Variables → Actions, then rerun. Templates will stay on the prior runtime version until either this token is set or each template is rebuilt manually."
exit 0
fi
echo "::error::TEMPLATE_DISPATCH_TOKEN secret missing — cascade cannot fan out."
echo "::error::DISPATCH_TOKEN secret missing — cascade cannot fan out."
echo "::error::PyPI was published, but the 8 template repos will NOT pick up the new version until this token is restored and a republish dispatches the cascade."
echo "::error::set it at Settings → Secrets and Variables → Actions; then re-trigger publish-runtime via workflow_dispatch."
exit 1
@@ -327,37 +318,119 @@ jobs:
echo "::error::publish job did not expose a version output — cascade cannot fan out"
exit 1
fi
# All 9 active workspace template repos. The PR #2536 pruning
# ("deprecated, no shipping images") was empirically wrong:
# continuous-synth-e2e.yml defaults to langgraph as its primary
# canary (line 44), and every excluded template had successful
# publish-image runs as of 2026-05-03 — none were dormant.
# Symptom of the prune: today's a2a-sdk strict-mode fix
# (#2566 / commit e1628c4) cascaded to 4 templates but never
# reached langgraph, so the synth-E2E correctly canary'd a fix
# that had landed but not deployed. Re-added the 5 templates.
# Long-term: derive this list from manifest.json so cascade
# scope can't drift from E2E scope — tracked in RFC #388 as a
# Phase-1 invariant.
# All 9 workspace templates declared in manifest.json. The list
# MUST stay aligned with manifest.json's workspace_templates —
# cascade-list-drift-gate.yml enforces this in CI per the
# codex-stuck-on-stale-runtime invariant from PR #2556.
# Long-term goal: derive this list from manifest.json so it
# can't drift even on a manifest edit (RFC #388 Phase-1).
#
# Per-template publish-image.yml presence is checked at
# cascade-time below: codex doesn't ship one today, so the
# cascade soft-skips it with an informational message rather
# than dropping it from this list (which would re-introduce
# the drift the gate exists to catch).
GITEA_URL="${GITEA_URL:-https://git.moleculesai.app}"
TEMPLATES="claude-code hermes openclaw codex langgraph crewai autogen deepagents gemini-cli"
FAILED=""
SKIPPED=""
# Configure git identity once. The persona owning DISPATCH_TOKEN
# is the same identity that authored this commit on each
# template; using a generic "publish-runtime cascade" co-author
# trailer in the message keeps the audit trail honest about the
# workflow-driven origin.
git config --global user.name "publish-runtime cascade"
git config --global user.email "publish-runtime@moleculesai.app"
WORKDIR="$(mktemp -d)"
for tpl in $TEMPLATES; do
REPO="molecule-ai/molecule-ai-workspace-template-$tpl"
STATUS=$(curl -sS -o /tmp/dispatch.out -w "%{http_code}" \
-X POST "https://api.github.com/repos/$REPO/dispatches" \
-H "Authorization: Bearer $DISPATCH_TOKEN" \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
-d "{\"event_type\":\"runtime-published\",\"client_payload\":{\"runtime_version\":\"$VERSION\"}}")
if [ "$STATUS" = "204" ]; then
echo "✓ dispatched $tpl ($VERSION)"
else
echo "::warning::✗ failed to dispatch $tpl: HTTP $STATUS — $(cat /tmp/dispatch.out)"
CLONE="$WORKDIR/$tpl"
# Pre-check: skip templates without a publish-image.yml.
# The cascade's job is to trip the template's on-push
# rebuild — if there's no rebuild workflow, pushing a
# .runtime-version commit is just noise on the target
# repo. Use the Gitea contents API (no clone required for
# the probe). 200 = present; 404 = absent.
HTTP=$(curl -sS -o /dev/null -w "%{http_code}" \
-H "Authorization: token $DISPATCH_TOKEN" \
"$GITEA_URL/api/v1/repos/$REPO/contents/.github/workflows/publish-image.yml")
if [ "$HTTP" = "404" ]; then
echo "↷ $tpl has no publish-image.yml — soft-skip (informational; manifest still tracks it)"
SKIPPED="$SKIPPED $tpl"
continue
fi
if [ "$HTTP" != "200" ]; then
echo "::warning::$tpl publish-image.yml probe returned HTTP $HTTP — proceeding anyway, push will surface the real failure if any"
fi
# Use a per-template attempt loop so a transient race (e.g.
# human pushing to the same template at the same instant)
# doesn't lose the cascade. Bounded retries (3) — beyond
# that we surface the failure and let the operator retry.
attempt=0
success=false
while [ $attempt -lt 3 ]; do
attempt=$((attempt + 1))
rm -rf "$CLONE"
if ! git clone --depth=1 \
"https://x-access-token:${DISPATCH_TOKEN}@${GITEA_URL#https://}/$REPO.git" \
"$CLONE" >/tmp/clone.log 2>&1; then
echo "::warning::clone $tpl attempt $attempt failed: $(tail -n3 /tmp/clone.log)"
sleep 2
continue
fi
cd "$CLONE"
echo "$VERSION" > .runtime-version
# Idempotency guard: if the file already matches, this
# publish is a re-run for a version already cascaded.
# Don't push a no-op commit (would spuriously re-trip the
# template's on-push and rebuild for nothing).
if git diff --quiet -- .runtime-version; then
echo "✓ $tpl already at $VERSION — no commit needed (idempotent)"
success=true
cd - >/dev/null
break
fi
git add .runtime-version
git commit -m "chore: pin runtime to $VERSION (publish-runtime cascade)" \
-m "Co-Authored-By: publish-runtime cascade <publish-runtime@moleculesai.app>" \
>/dev/null
if git push origin HEAD:main >/tmp/push.log 2>&1; then
echo "✓ $tpl pushed $VERSION on attempt $attempt"
success=true
cd - >/dev/null
break
fi
# Likely a non-fast-forward — pull-rebase and retry.
# Don't force-push: that would silently overwrite a racing
# human/cascade commit.
echo "::warning::push $tpl attempt $attempt failed, pull-rebasing: $(tail -n3 /tmp/push.log)"
git pull --rebase origin main >/tmp/rebase.log 2>&1 || true
cd - >/dev/null
done
if [ "$success" != "true" ]; then
FAILED="$FAILED $tpl"
fi
done
rm -rf "$WORKDIR"
if [ -n "$FAILED" ]; then
echo "::warning::Cascade incomplete. Failed templates:$FAILED"
# Don't fail the whole job — PyPI publish already succeeded;
# operators can retry the failed templates manually.
echo "::error::Cascade incomplete after 3 retries each. Failed templates:$FAILED"
echo "::error::PyPI publish succeeded; failed templates lag the new version. Re-run this workflow_dispatch with the same version to retry only the laggers (idempotent — already-cascaded templates skip)."
exit 1
fi
if [ -n "$SKIPPED" ]; then
echo "Cascade complete: pinned $VERSION on cascade-active templates. Soft-skipped (no publish-image.yml):$SKIPPED"
else
echo "Cascade complete: $VERSION pinned across all manifest workspace_templates."
fi
@@ -80,12 +80,12 @@ jobs:
#
# Uses a fine-grained PAT (PLUGIN_REPO_PAT) because the plugin repo
# is private and the default GITHUB_TOKEN is scoped to THIS repo.
# The PAT needs Contents:Read on molecule-ai/molecule-ai-plugin-
# The PAT needs Contents:Read on Molecule-AI/molecule-ai-plugin-
# github-app-auth. Falls back to the default token for the (rare)
# case where an operator made the plugin repo public.
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: molecule-ai/molecule-ai-plugin-github-app-auth
repository: Molecule-AI/molecule-ai-plugin-github-app-auth
path: molecule-ai-plugin-github-app-auth
token: ${{ secrets.PLUGIN_REPO_PAT || secrets.GITHUB_TOKEN }}
@@ -9,7 +9,7 @@ name: redeploy-tenants-on-main
#
# This workflow closes the gap by calling the control-plane admin
# endpoint that performs a canary-first, batched, health-gated rolling
# redeploy across every live tenant. Implemented in molecule-ai/
# redeploy across every live tenant. Implemented in Molecule-AI/
# molecule-controlplane as POST /cp/admin/tenants/redeploy-fleet
# (feat/tenant-auto-redeploy, landing alongside this workflow).
#
@@ -146,7 +146,7 @@ jobs:
- name: Call CP redeploy-fleet
# CP_ADMIN_API_TOKEN must be set as a repo/org secret on
# molecule-ai/molecule-core, matching the staging/prod CP's
# Molecule-AI/molecule-core, matching the staging/prod CP's
# CP_ADMIN_API_TOKEN env. Stored in Railway, mirrored to this
# repo's secrets for CI.
env:
@@ -97,7 +97,7 @@ jobs:
- name: Call staging-CP redeploy-fleet
# CP_STAGING_ADMIN_API_TOKEN must be set as a repo/org secret
# on molecule-ai/molecule-core, matching staging-CP's
# on Molecule-AI/molecule-core, matching staging-CP's
# CP_ADMIN_API_TOKEN env var (visible in Railway controlplane
# / staging environment). Stored separately from the prod
# CP_ADMIN_API_TOKEN so a leak of one doesn't auth the other.
@@ -96,7 +96,7 @@ jobs:
--body "$(cat <<'BODY'
[retarget-bot] This PR was opened against `main` and has been retargeted to `staging` automatically.
**Why:** per [SHARED_RULES rule 8](https://github.com/molecule-ai/molecule-ai-org-template-molecule-dev/blob/main/SHARED_RULES.md), all feature work targets `staging` first; the CEO promotes `staging → main` separately.
**Why:** per [SHARED_RULES rule 8](https://github.com/Molecule-AI/molecule-ai-org-template-molecule-dev/blob/main/SHARED_RULES.md), all feature work targets `staging` first; the CEO promotes `staging → main` separately.
**What changed:** just the base branch — no code change. CI will re-run against `staging`. If you get merge conflicts, rebase on `staging`.
+1 -1
View File
@@ -12,7 +12,7 @@ name: Secret scan
#
# jobs:
# secret-scan:
# uses: molecule-ai/molecule-core/.github/workflows/secret-scan.yml@staging
# uses: Molecule-AI/molecule-core/.github/workflows/secret-scan.yml@staging
#
# Pin to @staging not @main — staging is the active default branch,
# main lags via the staging-promotion workflow. Updates ride along
-89
View File
@@ -1,89 +0,0 @@
package main
import "testing"
// TestResolveBindHost pins the precedence: BIND_ADDR explicit > dev-mode
// fail-open default of 127.0.0.1 > production-shape empty (all interfaces).
//
// Mutation-test invariant: removing the IsDevModeFailOpen() branch makes
// "no_bindaddr_devmode_unset_admin" fail (returns "" instead of "127.0.0.1").
// Removing the BIND_ADDR branch makes "explicit_bindaddr_*" cases fail.
func TestResolveBindHost(t *testing.T) {
cases := []struct {
name string
bindAddr string
adminToken string
molEnv string
want string
}{
{
name: "no_bindaddr_devmode_unset_admin",
bindAddr: "",
adminToken: "",
molEnv: "dev",
want: "127.0.0.1",
},
{
name: "no_bindaddr_devmode_unset_admin_full_word",
bindAddr: "",
adminToken: "",
molEnv: "development",
want: "127.0.0.1",
},
{
name: "no_bindaddr_admin_set_in_dev_env",
bindAddr: "",
adminToken: "secret",
molEnv: "dev",
want: "", // ADMIN_TOKEN flips IsDevModeFailOpen to false → all interfaces
},
{
name: "no_bindaddr_production_env",
bindAddr: "",
adminToken: "",
molEnv: "production",
want: "", // production is not a dev value → all interfaces
},
{
name: "no_bindaddr_unset_env",
bindAddr: "",
adminToken: "",
molEnv: "",
want: "", // unset MOLECULE_ENV → not dev → all interfaces
},
{
name: "explicit_bindaddr_loopback_overrides_devmode",
bindAddr: "127.0.0.1",
adminToken: "",
molEnv: "dev",
want: "127.0.0.1",
},
{
name: "explicit_bindaddr_wildcard_overrides_devmode_default",
bindAddr: "0.0.0.0",
adminToken: "",
molEnv: "dev",
want: "0.0.0.0",
},
{
name: "explicit_bindaddr_in_production",
bindAddr: "10.0.5.7",
adminToken: "secret",
molEnv: "production",
want: "10.0.5.7",
},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
t.Setenv("BIND_ADDR", tc.bindAddr)
t.Setenv("ADMIN_TOKEN", tc.adminToken)
t.Setenv("MOLECULE_ENV", tc.molEnv)
got := resolveBindHost()
if got != tc.want {
t.Errorf("resolveBindHost() = %q, want %q (BIND_ADDR=%q ADMIN_TOKEN=%q MOLECULE_ENV=%q)",
got, tc.want, tc.bindAddr, tc.adminToken, tc.molEnv)
}
})
}
}
+3 -35
View File
@@ -19,7 +19,6 @@ import (
"github.com/Molecule-AI/molecule-monorepo/platform/internal/handlers"
"github.com/Molecule-AI/molecule-monorepo/platform/internal/imagewatch"
memwiring "github.com/Molecule-AI/molecule-monorepo/platform/internal/memory/wiring"
"github.com/Molecule-AI/molecule-monorepo/platform/internal/middleware"
"github.com/Molecule-AI/molecule-monorepo/platform/internal/pendinguploads"
"github.com/Molecule-AI/molecule-monorepo/platform/internal/provisioner"
"github.com/Molecule-AI/molecule-monorepo/platform/internal/registry"
@@ -338,23 +337,15 @@ func main() {
// Router
r := router.Setup(hub, broadcaster, prov, platformURL, configsDir, wh, channelMgr, memBundle)
// HTTP server with graceful shutdown.
//
// Bind host: in dev-mode (no ADMIN_TOKEN, MOLECULE_ENV=dev|development)
// the AdminAuth chain fails open by design; pairing that with a wildcard
// bind would expose unauth /workspaces to any same-LAN peer. Default to
// loopback when fail-open is active. Operators who need LAN exposure set
// BIND_ADDR=0.0.0.0 explicitly. Production (ADMIN_TOKEN set) is unchanged.
// See molecule-core#7.
bindHost := resolveBindHost()
// HTTP server with graceful shutdown
srv := &http.Server{
Addr: fmt.Sprintf("%s:%s", bindHost, port),
Addr: fmt.Sprintf(":%s", port),
Handler: r,
}
// Start server in goroutine
go func() {
log.Printf("Platform starting on %s:%s (dev-mode-fail-open=%v)", bindHost, port, middleware.IsDevModeFailOpen())
log.Printf("Platform starting on :%s", port)
if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatalf("Server failed: %v", err)
}
@@ -389,29 +380,6 @@ func envOr(key, fallback string) string {
return fallback
}
// resolveBindHost picks the listener interface for the HTTP server.
//
// Precedence:
// 1. BIND_ADDR — explicit operator override (any value, including "0.0.0.0").
// 2. dev-mode fail-open active → "127.0.0.1" (loopback only).
// 3. otherwise → "" (Go binds every interface; existing prod/self-host shape).
//
// Coupling the loopback default to middleware.IsDevModeFailOpen() means the
// two safety levers — bind narrowness and auth strength — move together. A
// production deploy (ADMIN_TOKEN set) keeps binding to all interfaces because
// the auth chain is doing its job; a dev Mac (no ADMIN_TOKEN, MOLECULE_ENV=dev)
// is reachable only via loopback because the auth chain is fail-open. See
// molecule-core#7 for the original LAN exposure finding.
func resolveBindHost() string {
if v := os.Getenv("BIND_ADDR"); v != "" {
return v
}
if middleware.IsDevModeFailOpen() {
return "127.0.0.1"
}
return ""
}
func findConfigsDir() string {
candidates := []string{
"workspace-configs-templates",
+6 -33
View File
@@ -4,7 +4,6 @@ import (
"bytes"
"context"
"io"
"log"
"os"
"path/filepath"
"strings"
@@ -178,42 +177,16 @@ func strDefault(m map[string]interface{}, key, fallback string) string {
return fallback
}
// findRunningContainer returns the live container name for workspaceID, or ""
// when the container is genuinely not running OR the daemon errored
// transiently. Routed through provisioner.RunningContainerName as the SSOT
// (molecule-core#10) so this handler agrees with healthsweep on the same
// inputs. Transient daemon errors are logged distinctly so triage doesn't
// confuse a flaky daemon with a stopped container.
func (h *PluginsHandler) findRunningContainer(ctx context.Context, workspaceID string) string {
name, err := provisioner.RunningContainerName(ctx, h.docker, workspaceID)
if err != nil {
log.Printf("plugins: docker inspect transient error for %s: %v (treating as not-running for this request)", workspaceID, err)
if h.docker == nil {
return ""
}
return name
}
// isExternalRuntime reports whether the workspace's runtime is the
// `external` (remote-pull) shape introduced in Phase 30. External
// workspaces have no local container — `POST /plugins` (push-install via
// docker exec) doesn't apply to them; they pull via the download endpoint
// instead. Returns false (allow-install) if the lookup is unwired or
// errors — failing open here is safe because the downstream
// findRunningContainer step still gates on a real container being there.
//
// Background — molecule-core#10: without this check, external workspaces
// fall through to findRunningContainer's NotFound path and return a
// misleading 503 "container not running" instead of a clear "use the
// pull endpoint" message.
func (h *PluginsHandler) isExternalRuntime(workspaceID string) bool {
if h.runtimeLookup == nil {
return false
name := provisioner.ContainerName(workspaceID)
info, err := h.docker.ContainerInspect(ctx, name)
if err == nil && info.State.Running {
return name
}
runtime, err := h.runtimeLookup(workspaceID)
if err != nil {
return false
}
return runtime == "external"
return ""
}
func (h *PluginsHandler) execAsRoot(ctx context.Context, containerName string, cmd []string) (string, error) {
@@ -1,176 +0,0 @@
package handlers
import (
"go/ast"
"go/parser"
"go/token"
"strings"
"testing"
)
// TestFindRunningContainer_RoutesThroughProvisionerSSOT is a behavior-based
// AST gate: it pins the invariant that PluginsHandler.findRunningContainer
// MUST go through provisioner.RunningContainerName for its is-running check,
// instead of carrying its own copy of cli.ContainerInspect logic.
//
// Background — molecule-core#10: a parallel impl of "is the workspace's
// container running" used to live in plugins.go. It drifted from the
// canonical impl in healthsweep (which goes through Provisioner.IsRunning
// → RunningContainerName) on edge cases like "transient daemon error" —
// the duplicate would 503 with a misleading message while healthsweep
// correctly stayed defensive. Consolidating onto RunningContainerName as
// the SSOT prevents any future copy from re-introducing that drift.
//
// Mutation invariant: if a future PR replaces the provisioner call with
// `h.docker.ContainerInspect(...)` directly, this test fails. That's the
// signal to either (a) extend RunningContainerName's contract OR (b)
// document why this call site needs to differ. Either way: the drift
// gets a reviewer's attention instead of shipping silently.
func TestFindRunningContainer_RoutesThroughProvisionerSSOT(t *testing.T) {
fset := token.NewFileSet()
file, err := parser.ParseFile(fset, "plugins.go", nil, parser.ParseComments)
if err != nil {
t.Fatalf("parse plugins.go: %v", err)
}
var fn *ast.FuncDecl
ast.Inspect(file, func(n ast.Node) bool {
f, ok := n.(*ast.FuncDecl)
if !ok || f.Name.Name != "findRunningContainer" {
return true
}
// Confirm receiver is *PluginsHandler so we don't pick up an unrelated
// helper of the same name. ast.Recv is a FieldList — receivers carry
// at most one field.
if f.Recv == nil || len(f.Recv.List) == 0 {
return true
}
fn = f
return false
})
if fn == nil {
t.Fatal("findRunningContainer not found in plugins.go — was it renamed? update this test or the SSOT routing assumption")
}
var (
callsRunningContainerName bool
callsContainerInspectRaw bool
)
ast.Inspect(fn.Body, func(n ast.Node) bool {
call, ok := n.(*ast.CallExpr)
if !ok {
return true
}
sel, ok := call.Fun.(*ast.SelectorExpr)
if !ok {
return true
}
// Pkg.Func form: provisioner.RunningContainerName(...)
if pkgIdent, ok := sel.X.(*ast.Ident); ok {
if pkgIdent.Name == "provisioner" && sel.Sel.Name == "RunningContainerName" {
callsRunningContainerName = true
}
}
// Receiver-then-method form: h.docker.ContainerInspect(...) /
// p.cli.ContainerInspect(...) — anything ending in
// .ContainerInspect that's NOT routed through provisioner.
if sel.Sel.Name == "ContainerInspect" {
callsContainerInspectRaw = true
}
return true
})
if !callsRunningContainerName {
t.Errorf(
"findRunningContainer must call provisioner.RunningContainerName for the SSOT inspect — see molecule-core#10. Found no such call.",
)
}
if callsContainerInspectRaw {
t.Errorf(
"findRunningContainer carries a direct ContainerInspect call. This is the parallel-impl drift molecule-core#10 fixed. " +
"Either route through provisioner.RunningContainerName OR — if a new use case truly needs a different inspect — extend RunningContainerName's contract first and update this gate to allow the specific delta.",
)
}
}
// TestProvisionerIsRunning_RoutesThroughRunningContainerName mirrors the
// gate above but for the OTHER consumer of the SSOT — Provisioner.IsRunning
// (called by healthsweep). If a future refactor makes IsRunning carry its
// own ContainerInspect again, the two consumers' edge-case behaviors will
// silently drift. Keep them yoked.
func TestProvisionerIsRunning_RoutesThroughRunningContainerName(t *testing.T) {
fset := token.NewFileSet()
file, err := parser.ParseFile(fset, "../provisioner/provisioner.go", nil, parser.ParseComments)
if err != nil {
t.Fatalf("parse provisioner.go: %v", err)
}
var fn *ast.FuncDecl
ast.Inspect(file, func(n ast.Node) bool {
f, ok := n.(*ast.FuncDecl)
if !ok || f.Name.Name != "IsRunning" || f.Recv == nil {
return true
}
// The receiver type must be *Provisioner specifically. CPProvisioner
// has its own IsRunning that talks HTTP to the controlplane and is
// out of scope for this gate.
if !receiverIs(f, "Provisioner") {
return true
}
fn = f
return false
})
if fn == nil {
t.Fatal("Provisioner.IsRunning not found — was it renamed? update this test")
}
var (
callsRunningContainerName bool
callsContainerInspectRaw bool
)
ast.Inspect(fn.Body, func(n ast.Node) bool {
call, ok := n.(*ast.CallExpr)
if !ok {
return true
}
// Same-package call: bare identifier (e.g. RunningContainerName(...)).
if id, ok := call.Fun.(*ast.Ident); ok && id.Name == "RunningContainerName" {
callsRunningContainerName = true
return true
}
// Selector call: pkg.Func (e.g. provisioner.RunningContainerName)
// OR recv.Method (e.g. p.cli.ContainerInspect).
sel, ok := call.Fun.(*ast.SelectorExpr)
if !ok {
return true
}
switch sel.Sel.Name {
case "RunningContainerName":
callsRunningContainerName = true
case "ContainerInspect":
callsContainerInspectRaw = true
}
return true
})
if !callsRunningContainerName {
t.Errorf("Provisioner.IsRunning must call RunningContainerName for the SSOT inspect — see molecule-core#10")
}
if callsContainerInspectRaw {
t.Errorf("Provisioner.IsRunning carries a direct ContainerInspect call; route through RunningContainerName instead")
}
}
// receiverIs reports whether fn's receiver is `*<typeName>` or `<typeName>`.
func receiverIs(fn *ast.FuncDecl, typeName string) bool {
if fn.Recv == nil || len(fn.Recv.List) == 0 {
return false
}
expr := fn.Recv.List[0].Type
if star, ok := expr.(*ast.StarExpr); ok {
expr = star.X
}
id, ok := expr.(*ast.Ident)
return ok && strings.EqualFold(id.Name, typeName)
}
@@ -32,18 +32,6 @@ import (
// inside the workspace at startup.
func (h *PluginsHandler) Install(c *gin.Context) {
workspaceID := c.Param("id")
// External-runtime guard (molecule-core#10): push-install via docker
// exec is meaningless for `runtime='external'` workspaces — they have
// no local container. Reject early with a hint pointing at the
// pull-mode endpoint, instead of falling through to a misleading
// "container not running" 503 from findRunningContainer.
if h.isExternalRuntime(workspaceID) {
c.JSON(http.StatusUnprocessableEntity, gin.H{
"error": "plugin install via push is not supported for external runtimes",
"hint": "external workspaces pull plugins via GET /workspaces/:id/plugins/:name/download",
})
return
}
// Cap the JSON body so a pathological POST can't exhaust parser memory.
bodyMax := envx.Int64("PLUGIN_INSTALL_BODY_MAX_BYTES", defaultInstallBodyMaxBytes)
c.Request.Body = http.MaxBytesReader(c.Writer, c.Request.Body, bodyMax)
@@ -105,16 +93,6 @@ func (h *PluginsHandler) Uninstall(c *gin.Context) {
pluginName := c.Param("name")
ctx := c.Request.Context()
// Mirror Install's external-runtime guard (molecule-core#10) so the
// two endpoints reject the same shape with the same message.
if h.isExternalRuntime(workspaceID) {
c.JSON(http.StatusUnprocessableEntity, gin.H{
"error": "plugin uninstall via docker exec is not supported for external runtimes",
"hint": "external workspaces manage their own plugin directory; remove it locally",
})
return
}
if err := validatePluginName(pluginName); err != nil {
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid plugin name"})
return
@@ -1,176 +0,0 @@
package handlers
import (
"bytes"
"net/http"
"net/http/httptest"
"strings"
"testing"
"github.com/gin-gonic/gin"
)
// TestPluginInstall_ExternalRuntime_Returns422 — molecule-core#10.
// Install on a `runtime='external'` workspace must NOT fall through to
// findRunningContainer (which would 503 with a misleading "container not
// running"). It must return 422 with a hint pointing at the pull-mode
// download endpoint.
func TestPluginInstall_ExternalRuntime_Returns422(t *testing.T) {
h := NewPluginsHandler(t.TempDir(), nil, nil).
WithRuntimeLookup(func(workspaceID string) (string, error) {
return "external", nil
})
w := httptest.NewRecorder()
c, _ := gin.CreateTestContext(w)
c.Params = gin.Params{{Key: "id", Value: "ba1789b0-4d21-4f4f-a878-fa226bf77cf5"}}
c.Request = httptest.NewRequest(
"POST",
"/workspaces/ba1789b0-4d21-4f4f-a878-fa226bf77cf5/plugins",
bytes.NewBufferString(`{"source":"local://my-plugin"}`),
)
c.Request.Header.Set("Content-Type", "application/json")
h.Install(c)
if w.Code != http.StatusUnprocessableEntity {
t.Errorf("expected 422 (Unprocessable Entity) for runtime='external', got %d: %s", w.Code, w.Body.String())
}
if !strings.Contains(w.Body.String(), "external runtimes") {
t.Errorf("expected error body to mention 'external runtimes', got: %s", w.Body.String())
}
if !strings.Contains(w.Body.String(), "download") {
t.Errorf("expected error body to point at the download endpoint, got: %s", w.Body.String())
}
}
// TestPluginUninstall_ExternalRuntime_Returns422 — symmetric guard on the
// uninstall path (DELETE /workspaces/:id/plugins/:name). External
// workspaces manage their own plugin directory locally; the platform
// can't docker-exec into them.
func TestPluginUninstall_ExternalRuntime_Returns422(t *testing.T) {
h := NewPluginsHandler(t.TempDir(), nil, nil).
WithRuntimeLookup(func(workspaceID string) (string, error) {
return "external", nil
})
w := httptest.NewRecorder()
c, _ := gin.CreateTestContext(w)
c.Params = gin.Params{
{Key: "id", Value: "ba1789b0-4d21-4f4f-a878-fa226bf77cf5"},
{Key: "name", Value: "my-plugin"},
}
c.Request = httptest.NewRequest(
"DELETE",
"/workspaces/ba1789b0-4d21-4f4f-a878-fa226bf77cf5/plugins/my-plugin",
nil,
)
h.Uninstall(c)
if w.Code != http.StatusUnprocessableEntity {
t.Errorf("expected 422 for runtime='external', got %d: %s", w.Code, w.Body.String())
}
if !strings.Contains(w.Body.String(), "external runtimes") {
t.Errorf("expected error body to mention 'external runtimes', got: %s", w.Body.String())
}
}
// TestPluginInstall_ContainerBackedRuntime_FallsThroughGuard — the runtime
// guard MUST NOT short-circuit container-backed runtimes. With
// `runtime='claude-code'` the install proceeds past the guard; without a
// real plugin source it'll fail downstream (here: 404 from local resolver
// because no plugin staged), which is the correct error to surface.
//
// This is the mutation-test partner: deleting the `runtime == "external"`
// check would still pass TestPluginInstall_ExternalRuntime (because Install
// would 404 instead of 422 — but the test asserts 422), and would still
// pass this test (because both pre-fix and post-fix produce 404 here).
// What this case pins is "non-external still falls through," catching
// any over-eager guard that rejects all runtimes.
func TestPluginInstall_ContainerBackedRuntime_FallsThroughGuard(t *testing.T) {
h := NewPluginsHandler(t.TempDir(), nil, nil).
WithRuntimeLookup(func(workspaceID string) (string, error) {
return "claude-code", nil
})
w := httptest.NewRecorder()
c, _ := gin.CreateTestContext(w)
c.Params = gin.Params{{Key: "id", Value: "c7c28c0b-4ea5-4e75-9728-3ba860081708"}}
c.Request = httptest.NewRequest(
"POST",
"/workspaces/c7c28c0b-4ea5-4e75-9728-3ba860081708/plugins",
bytes.NewBufferString(`{"source":"local://nonexistent-plugin"}`),
)
c.Request.Header.Set("Content-Type", "application/json")
h.Install(c)
if w.Code == http.StatusUnprocessableEntity {
t.Errorf("runtime='claude-code' must fall through the external guard; got 422: %s", w.Body.String())
}
// The local resolver will fail to find the plugin → 404. Anything
// other than 422 (which would mean we mis-classified) is fine.
if w.Code != http.StatusNotFound {
t.Errorf("expected 404 (plugin not found in registry), got %d: %s", w.Code, w.Body.String())
}
}
// TestPluginInstall_NoRuntimeLookup_FailsOpen — when the runtime lookup
// is unwired (test fixtures, niche deploy shapes) the guard MUST default
// to allowing the install attempt. The downstream findRunningContainer
// step still gates on a real container, so failing open here doesn't
// expose a bypass — it just preserves backwards-compat with deployments
// that haven't wired the lookup.
func TestPluginInstall_NoRuntimeLookup_FailsOpen(t *testing.T) {
h := NewPluginsHandler(t.TempDir(), nil, nil) // NO WithRuntimeLookup
w := httptest.NewRecorder()
c, _ := gin.CreateTestContext(w)
c.Params = gin.Params{{Key: "id", Value: "ws-no-lookup"}}
c.Request = httptest.NewRequest(
"POST",
"/workspaces/ws-no-lookup/plugins",
bytes.NewBufferString(`{"source":"local://nonexistent"}`),
)
c.Request.Header.Set("Content-Type", "application/json")
h.Install(c)
if w.Code == http.StatusUnprocessableEntity {
t.Errorf("nil runtimeLookup must fall through (fail-open); got 422: %s", w.Body.String())
}
}
// TestPluginInstall_RuntimeLookupErrors_FailsOpen — same fail-open story
// for transient DB errors in the lookup. We don't want a momentary
// Postgres hiccup to flip every plugin install into a 422.
func TestPluginInstall_RuntimeLookupErrors_FailsOpen(t *testing.T) {
h := NewPluginsHandler(t.TempDir(), nil, nil).
WithRuntimeLookup(func(workspaceID string) (string, error) {
return "", errFakeDB
})
w := httptest.NewRecorder()
c, _ := gin.CreateTestContext(w)
c.Params = gin.Params{{Key: "id", Value: "ws-db-flake"}}
c.Request = httptest.NewRequest(
"POST",
"/workspaces/ws-db-flake/plugins",
bytes.NewBufferString(`{"source":"local://nonexistent"}`),
)
c.Request.Header.Set("Content-Type", "application/json")
h.Install(c)
if w.Code == http.StatusUnprocessableEntity {
t.Errorf("runtimeLookup error must fall through (fail-open); got 422: %s", w.Body.String())
}
}
// errFakeDB is a sentinel for the fail-open lookup-error case.
var errFakeDB = &fakeError{msg: "synthetic db error"}
type fakeError struct{ msg string }
func (e *fakeError) Error() string { return e.msg }
@@ -1073,53 +1073,18 @@ func (p *Provisioner) IsRunning(ctx context.Context, workspaceID string) (bool,
if p == nil || p.cli == nil {
return false, ErrNoBackend
}
name, err := RunningContainerName(ctx, p.cli, workspaceID)
name := ContainerName(workspaceID)
info, err := p.cli.ContainerInspect(ctx, name)
if err != nil {
if isContainerNotFound(err) {
return false, nil
}
// Transient daemon error: caller treats !running as dead + restarts.
// Returning true + the underlying error preserves the error for
// metrics/logging without triggering the destructive path.
return true, err
}
return name != "", nil
}
// RunningContainerName returns the container name for workspaceID iff the
// container exists AND is in the Running state. Single source of truth for
// "what live container should I exec into for this workspace?" — used by
// both Provisioner.IsRunning (healthsweep) and the plugins handler.
//
// Distinguishes three outcomes so callers can pick their own policy:
//
// - ("ws-<id>", nil): container is running. Caller can exec into it.
// - ("", nil): container does not exist OR exists but is stopped
// (NotFound, Exited, Created, Restarting…). Caller
// should treat as a definitive "not running."
// - ("", err): transient daemon error (timeout, socket EOF, ctx
// cancel). Caller should NOT infer "not running" —
// this could be a flaky daemon under load. Decide
// per-callsite whether to fail soft or hard.
//
// Background — molecule-core#10: the plugins handler used to carry its own
// copy of this inspect logic (`findRunningContainer`) which collapsed
// transient errors into the same "" return as a genuinely-stopped container.
// That hid daemon flakes as misleading 503 "container not running" responses
// AND let the two impls drift on edge-case behavior. This is the SSOT.
func RunningContainerName(ctx context.Context, cli *client.Client, workspaceID string) (string, error) {
if cli == nil {
return "", ErrNoBackend
}
name := ContainerName(workspaceID)
info, err := cli.ContainerInspect(ctx, name)
if err != nil {
if isContainerNotFound(err) {
return "", nil
}
return "", err
}
if info.State.Running {
return name, nil
}
return "", nil
return info.State.Running, nil
}
// isContainerNotFound returns true when the Docker client indicates the