Compare commits

..

2 Commits

Author SHA1 Message Date
core-devops 862a90a316 fix(runtime): complete crewai/deepagents/gemini-cli removal + fail-closed on unresolvable runtime
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 2s
cascade-list-drift-gate / check (pull_request) Failing after 2s
CI / Detect changes (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 4m18s
E2E API Smoke Test / detect-changes (pull_request) Successful in 4s
E2E Chat / detect-changes (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 4s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 2s
Harness Replays / detect-changes (pull_request) Successful in 3s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 3s
gate-check-v3 / gate-check (pull_request) Successful in 3s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 22s
qa-review / approved (pull_request) Failing after 2s
security-review / approved (pull_request) Successful in 3s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 3s
sop-tier-check / tier-check (pull_request) Successful in 3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 50s
CI / Canvas (Next.js) (pull_request) Successful in 5m38s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 37s
Harness Replays / Harness Replays (pull_request) Successful in 1s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1m11s
CI / Python Lint & Test (pull_request) Successful in 6m20s
CI / all-required (pull_request) Successful in 6m21s
E2E Chat / E2E Chat (pull_request) Failing after 4m13s
audit-force-merge / audit (pull_request) Successful in 3s
Addresses core-security REQUEST-CHANGES (review 4269) on PR #1385
(RFC internal#483). The original removal was incomplete and silently
changed behavior:

1. manifest.json workspace_templates still listed crewai/deepagents/
   gemini-cli. handlers.knownRuntimes (the workspace-create admission
   allowlist) is built at boot from manifest.json via runtime_registry.go
   — NOT from provisioner.knownRuntimes — so a create request for a
   "removed" runtime was still admitted. Removed the 3 entries; the
   manifest now resolves to the 6 kept template-backed runtimes
   (claude-code, hermes, openclaw, codex, langgraph, autogen) + injected
   meta-runtimes. langgraph/autogen retained intentionally — not in the
   internal#483 removal set and still in provisioner.knownRuntimes.

2. selectImage silently fell back to DefaultImage (langgraph) for any
   named-but-unresolvable runtime: a user asking for crewai got a
   langgraph container with no signal. selectImage now returns
   (string, error) and rejects a NAMED-but-unresolvable runtime with
   ErrUnresolvableRuntime (naming the offending runtime). Start
   propagates it; the existing markProvisionFailed path already
   broadcasts WorkspaceProvisionFailed + records the message. Implements
   the CTO hardgate directive (feedback_platform_must_hardgate_base_contract
   — fail-closed, no silent degrade). The genuinely-unspecified (empty)
   runtime remains a distinct legitimate path → DefaultImage.

go build ./... and go test ./internal/provisioner/ ./internal/handlers/
./internal/imagewatch/ ./internal/registry/ ./internal/models/ all pass.
selectImage tests updated to assert the new fail-closed contract (new
expected values verified against the security finding, not made-green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 15:36:16 -07:00
core-devops e809a2e319 chore(runtime): remove crewai/deepagents/gemini-cli from the runtime catalog
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 15s
E2E API Smoke Test / detect-changes (pull_request) Successful in 5s
E2E Chat / detect-changes (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 4s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 28s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Harness Replays / detect-changes (pull_request) Successful in 5s
CI / Platform (Go) (pull_request) Successful in 4m59s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 49s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 3s
gate-check-v3 / gate-check (pull_request) Successful in 4s
qa-review / approved (pull_request) Failing after 2s
security-review / approved (pull_request) Failing after 3s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 2s
sop-tier-check / tier-check (pull_request) Successful in 4s
CI / Canvas (Next.js) (pull_request) Successful in 6m34s
CI / Python Lint & Test (pull_request) Successful in 6m28s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 36s
Harness Replays / Harness Replays (pull_request) Successful in 2s
CI / all-required (pull_request) Successful in 6m23s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m3s
E2E Chat / E2E Chat (pull_request) Failing after 4m58s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CTO-approved scope-narrowing to the four maintained runtimes
(claude-code, openclaw, hermes, codex) plus autogen/langgraph legacy.
RFC molecule-ai/internal#483.

- provisioner/registry.go: drop the 3 from knownRuntimes (kept
  alphabetical ordering; count pin updated 9 -> 6 in registry_test.go).
- handlers/admin_workspace_images.go: drop the 3 from AllRuntimes so
  imagewatch stops pulling now-removed template images.
- models/runtime_defaults.go: doc-comment list updated; drop the 3
  explicit drift-detection rows from runtime_defaults_test.go.
- localbuild_test.go RepoNotFound fixture switched crewai -> hermes
  (post-removal crewai is no longer a known runtime, so the test would
  hit the "unknown runtime" branch instead of the intended 404 path).
- provisioner_test.go error-classification fixture string crewai ->
  hermes (no removed-runtime names in fixtures).

Live-workspace safety gate cleared first: zero workspaces on the 3
runtimes across the production fleet (per-tenant DB query on
hongming/go-to-market/chloe-dong + CP tenant_resources provisioning
ledger covering reno-stars). Registry/pin row removal handled by a
separate controlplane migration PR, landed after this merges.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 15:17:26 -07:00
4 changed files with 1 additions and 106 deletions
+1 -89
View File
@@ -23,7 +23,6 @@ import dataclasses
import json
import os
import sys
import time
import urllib.error
import urllib.parse
import urllib.request
@@ -327,43 +326,6 @@ def update_pull(pr_number: int, *, dry_run: bool) -> None:
)
def wait_for_ci(
head_sha: str,
contexts: list[str],
*,
max_wait_seconds: int = 300,
poll_interval: int = 15,
) -> bool:
"""Poll CI statuses for head_sha until all required contexts are terminal.
Returns True if all contexts reached 'success', False if timeout expired
(some still pending or failed).
Background: after a queue-triggered PR update, CI re-runs on the new head.
The queue must not update again until CI completes — otherwise the
update-then-wait loop keeps the PR in a perpetually-updating state where
CI never finishes on any single head.
"""
deadline = time.time() + max_wait_seconds
while time.time() < deadline:
time.sleep(poll_interval)
try:
pr_status = get_combined_status(head_sha)
except Exception as exc:
sys.stderr.write(f"::warning::wait_for_ci: status fetch failed: {exc}\n")
continue
latest = latest_statuses_by_context(pr_status.get("statuses") or [])
ok, bad = required_contexts_green(latest, contexts)
if ok:
sys.stderr.write(f"::notice::wait_for_ci: all contexts green after {int(time.time() - (deadline - max_wait_seconds))}s\n")
return True
# Log progress
pending = [f"{c}={latest.get(c, {}).get('status', 'missing')}" for c in contexts if latest.get(c, {}).get('status') != 'success']
sys.stderr.write(f"::notice::wait_for_ci: still waiting ({int(deadline - time.time())}s left): {', '.join(pending[:3])}\n")
sys.stderr.write(f"::warning::wait_for_ci: timeout after {max_wait_seconds}s; proceeding with merge check\n")
return False
def merge_pull(pr_number: int, *, dry_run: bool) -> None:
payload = {
"Do": "merge",
@@ -376,24 +338,7 @@ def merge_pull(pr_number: int, *, dry_run: bool) -> None:
print(f"::notice::merging PR #{pr_number}")
if dry_run:
return
# Gitea's merge endpoint returns HTTP 200 with an empty body on success.
# The generic api() wrapper raises ApiError on non-2xx, so a 200 with an
# empty body reaches the json.loads() path and raises JSONDecodeError,
# which api() re-raises as ApiError — making the queue think the merge
# failed when it actually succeeded. Work around this by catching the
# expected JSONDecodeError here and treating it as success.
try:
api("POST", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/merge", body=payload, expect_json=False)
except ApiError as exc:
# Surface non-merge errors (5xx server errors, 403 forbidden, etc.)
if "merge" in str(exc).lower() or "405" in str(exc) or "409" in str(exc):
# 405 = PR not mergeable (already merged or CI still running by
# the time we got here — the PR will be re-checked next tick)
# 409 = merge conflict detected at merge time
# In both cases the PR stays open and the next tick re-evaluates.
sys.stderr.write(f"::warning::merge call returned: {exc}\n")
else:
raise
api("POST", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/merge", body=payload, expect_json=False)
def process_once(*, dry_run: bool = False) -> int:
@@ -445,32 +390,6 @@ def process_once(*, dry_run: bool = False) -> int:
print(f"::notice::PR #{pr_number} decision={decision.action}: {decision.reason}")
if decision.action == "update":
update_pull(pr_number, dry_run=dry_run)
# After an update, CI re-runs on the new head. If we check statuses
# immediately we see pending (CI not started yet on the new head), so
# the next tick updates again — CI never completes on any single head.
# Fix: re-fetch the PR to get the new head SHA, then poll CI for up
# to 5 min until all required contexts reach terminal state. If CI
# finishes in time, proceed to merge on the same tick.
if not dry_run:
updated_pr = get_pull(pr_number)
new_head = updated_pr.get("head", {}).get("sha", "")
if new_head and new_head != head_sha:
sys.stderr.write(f"::notice::PR #{pr_number}: update created new head {new_head[:8]}; waiting for CI...\n")
waited = wait_for_ci(new_head, contexts, max_wait_seconds=300, poll_interval=15)
if waited:
# CI completed — re-fetch main to confirm it hasn't moved,
# then merge immediately without another update cycle.
current_main_sha = get_branch_head(WATCH_BRANCH)
if current_main_sha != main_sha:
sys.stderr.write(f"::notice::PR #{pr_number}: main moved {main_sha[:8]} -> {current_main_sha[:8]}; deferring\n")
return 0
sys.stderr.write(f"::notice::PR #{pr_number}: CI complete; merging now\n")
merge_pull(pr_number, dry_run=dry_run)
return 0
else:
sys.stderr.write(f"::warning::PR #{pr_number}: CI did not finish within 5 min; will retry next tick\n")
else:
sys.stderr.write(f"::notice::PR #{pr_number}: update did not change head SHA; will retry\n")
post_comment(
pr_number,
(
@@ -481,13 +400,6 @@ def process_once(*, dry_run: bool = False) -> int:
)
return 0
if decision.ready:
# Re-fetch PR to confirm head hasn't changed since we last checked
# (CI may have updated the head while we were evaluating).
current_pr = get_pull(pr_number)
current_head = current_pr.get("head", {}).get("sha", "")
if current_head != head_sha:
print(f"::notice::PR #{pr_number} head changed {head_sha[:8]} -> {current_head[:8]}; re-evaluating")
return 0
latest_main_sha = get_branch_head(WATCH_BRANCH)
if latest_main_sha != main_sha:
print(
-6
View File
@@ -32,12 +32,6 @@ on:
# iterating all open PRs when PR_NUMBER is empty.
workflow_dispatch:
# Cancel stale runs so the 8-runner pool stays available for PR jobs.
# Per-SHA group ensures push and cron runs at different SHAs don't cancel each other.
concurrency:
group: gate-check-v3-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: true
permissions:
# read: contents — for checkout (base ref, not PR head for security)
# read: pull-requests — for reading PR info via API
@@ -44,12 +44,6 @@ on:
- ".github/scripts/lint_secret_pattern_drift.py"
- ".githooks/pre-commit"
# Cancel stale runs to keep the 8-runner pool available for PR jobs.
# Per-SHA group ensures push and scheduled runs at different SHAs don't cancel each other.
concurrency:
group: secret-pattern-drift-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: true
env:
GITHUB_SERVER_URL: https://git.moleculesai.app
-5
View File
@@ -22,11 +22,6 @@ on:
- cron: '17 4 * * 1' # Mondays at 04:17 UTC
workflow_dispatch:
# Cancel stale runs to keep the 8-runner pool available for PR jobs.
concurrency:
group: weekly-platform-go-${{ github.event.pull_request.head.sha || github.sha }}
cancel-in-progress: true
permissions:
contents: read
statuses: write