Compare commits

..

52 Commits

Author SHA1 Message Date
molecule-ai[bot] 46c8c1de23 Merge pull request #2602 from Molecule-AI/staging
staging → main: auto-promote 6d38b96
2026-05-03 16:49:40 +00:00
Hongming Wang 6d38b96043 Merge pull request #2601 from Molecule-AI/fix/2483-negative-cache-branch-tests
test(envelope-enrichment): pin negative-cache for non-JSON 200 + non-dict JSON 200 (#2483)
2026-05-03 16:37:30 +00:00
Hongming Wang 270a95aa67 test(envelope-enrichment): pin negative-cache for non-JSON 200 + non-dict JSON 200 (#2483)
The two missing branch tests called out by the multi-axis review of #2471.

a2a_client.enrich_peer_metadata handles two failure shapes (lines 105-112)
that the existing 12 envelope-enrichment tests don't exercise:

  1. HTTP 200, response.json() raises (non-JSON body)
  2. HTTP 200, valid JSON, but body is list/string/number not dict

Both paths land at the negative-cache write, but no test verified the
discriminator. Pin both with the same call_count == 1 assertion shape
the 5xx + network-exception tests already use.

Verified: temporarily removing the negative-cache write in either
branch makes the corresponding test fail with call_count == 2 — the
assertion correctly discriminates the contract from a fall-through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:35:21 -07:00
Hongming Wang 6431bdc631 Merge pull request #2600 from Molecule-AI/auto-sync/main-72b6be82
chore: sync main → staging (auto, ff to 72b6be82)
2026-05-03 16:23:16 +00:00
molecule-ai[bot] 72b6be82b0 Merge pull request #2599 from Molecule-AI/staging
staging → main: auto-promote b425995
2026-05-03 09:18:48 -07:00
Hongming Wang b42599585e Merge pull request #2598 from Molecule-AI/fix/auto-promote-skip-empty-tree
fix(auto-promote): skip empty-tree promotes to break perpetual cycle
2026-05-03 15:59:05 +00:00
Hongming Wang 06bfed2e35 Merge pull request #2597 from Molecule-AI/auto-sync/main-d1eab79d
chore: sync main → staging (auto, ff to d1eab79d)
2026-05-03 15:57:47 +00:00
Hongming Wang 80b38900de fix(auto-promote): skip empty-tree promotes to break perpetual cycle
The auto-promote ↔ auto-sync chain has been generating empty PRs
indefinitely since the staging merge_queue ruleset uses MERGE
strategy:

1. Auto-promote merges PR via queue → main = merge commit M2 not in staging
2. Auto-sync opens sync-back PR. Workflow's local `git merge --ff-only`
   succeeds (PR title even says "ff to ..."), but the queue lands the
   PR via MERGE → staging = merge commit S2 not in main
3. Auto-promote sees staging ahead by 1 → opens new promote PR. Tree
   diff vs main = 0 (S2's tree == main's tree). But the gate logic
   only checks "all required workflows green", not "actual code to
   ship" → opens an empty promote PR
4. ... repeat indefinitely

Each round costs ~30-40 min wallclock, ~2 manual approvals (the queue
requires 1 review and the bot can't self-approve without admin
bypass), and one full CodeQL Go run (~15 min).

Observed today (2026-05-03) across PRs #2592#2594#2595#2596#2597 — 5 PRs, ~3 hours, all empty content.

Fix: before opening the promote PR, check that staging's tree
actually differs from main's tree. If they're identical (the
empty-merge-commit cycle), skip cleanly and let the cycle terminate.

Implementation:
- New step `Skip if staging tree == main tree` runs before the
  existing gate check.
- `git diff --quiet origin/main $HEAD_SHA` exits 0 iff trees match.
- On match: emits a step summary explaining the skip + sets
  `skip=true`; subsequent gate-check + promote steps are gated on
  `skip != 'true'` so they short-circuit.
- Fail-open: if `git fetch` errors, fall through to gate check
  (preserve existing behavior). Only skip when diff is DEFINITIVELY
  empty.

Long-term, the cleaner fix is to switch the merge_queue ruleset's
merge_method away from MERGE so FF-able PRs land cleanly without a
new commit — but that's a broader change affecting every staging
PR's commit shape. This guard is the surgical one-step break.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 08:56:44 -07:00
molecule-ai[bot] d1eab79d28 Merge pull request #2596 from Molecule-AI/staging
staging → main: auto-promote 824a2a7
2026-05-03 15:50:12 +00:00
Hongming Wang 824a2a7657 Merge pull request #2595 from Molecule-AI/auto-sync/main-876d6ec8
chore: sync main → staging (auto, ff to 876d6ec8)
2026-05-03 15:38:22 +00:00
molecule-ai[bot] 876d6ec8c9 Merge pull request #2594 from Molecule-AI/staging
staging → main: auto-promote 63e3d38
2026-05-03 08:33:51 -07:00
Hongming Wang 63e3d385d6 Merge pull request #2592 from Molecule-AI/auto-sync/main-2e78812f
chore: sync main → staging (auto, ff to 2e78812f)
2026-05-03 15:15:01 +00:00
molecule-ai[bot] 2e78812ff9 Merge pull request #2591 from Molecule-AI/staging
staging → main: auto-promote 19cc833
2026-05-03 15:04:00 +00:00
Hongming Wang 9664d66e4b Merge branch 'main' into staging 2026-05-03 07:48:31 -07:00
Hongming Wang 19cc83313a Merge pull request #2589 from Molecule-AI/fix/retarget-skip-staging-head
fix(retarget): skip PRs whose head is staging (auto-promote PRs)
2026-05-03 14:36:44 +00:00
molecule-ai[bot] 097d513b65 Merge pull request #2588 from Molecule-AI/staging
staging → main: auto-promote c45aa8d
2026-05-03 07:35:05 -07:00
Hongming Wang 2b3f44c3c8 fix(retarget): skip PRs whose head is staging (auto-promote PRs)
The retarget-main-to-staging workflow tries to PATCH base=staging on
every bot-authored PR opened against main. Auto-promote staging→main
PRs have head=staging, base=main — retargeting them sets head AND
base to staging, which GitHub rejects with HTTP 422 "no new commits
between base 'staging' and head 'staging'".

This started surfacing on PR #2588 (2026-05-03 14:30) once #2586
switched the auto-promote workflow to an App token. Before #2586
the auto-promote PR was authored by github-actions[bot], which the
retarget filter happened to skip; now it's molecule-ai[bot], which
passes the bot filter and triggers the broken retarget attempt.

Add a head-ref != 'staging' guard so auto-promote PRs short-circuit
before the PATCH. The existing 422 "duplicate base" detector is
left alone — it covers a different operational case.
2026-05-03 07:34:24 -07:00
Hongming Wang c45aa8d7ee Merge pull request #2587 from Molecule-AI/auto-sync/main-b4e45374
chore: sync main → staging (auto, ff to b4e45374)
2026-05-03 14:19:28 +00:00
Hongming Wang b4e45374bf Merge pull request #2586 from Molecule-AI/fix/auto-promote-app-token
fix(auto-promote): use App token for auto-merge to fire downstream cascade (#2357)
2026-05-03 07:15:31 -07:00
Hongming Wang f2d69f0088 Merge pull request #2585 from Molecule-AI/fix/canvas-loading-state-aria
fix(canvas): add role=status + aria-live to remaining loading states
2026-05-03 14:14:33 +00:00
Hongming Wang bc11ed8a2b fix(auto-promote): use App token for auto-merge to fire downstream cascade (#2357)
GITHUB_TOKEN-initiated merges suppress the downstream `push` event on
main per GitHub's documented limitation:
  https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow

Result before this fix: every staging→main promote landed silently —
publish-workspace-server-image, canary-verify, and redeploy-tenants-on-main
all stayed dark. The polling tail was the SOLE cascade trigger; if it
ever 30-min-timed-out the chain dead-locked invisibly.

Symptom (from the issue body, 2026-04-30):

| Time     | Event                                            | Triggered? |
|----------|--------------------------------------------------|-----------|
| 05:48:04 | Promote PR #2352 merged (c140ad28)               | No fired  |
| 06:07:29 | Promote PR #2356 merged (5973c9bd)               | No fired  |

Fix: mint the molecule-ai App token BEFORE the promote-PR step and
hand it to the auto-merge call. App-token-initiated merges DO trigger
downstream workflow_run cascades.

The polling tail stays as defense-in-depth (with comments updated):
once we've observed >=10 successful natural cascades it can be dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 07:13:26 -07:00
Hongming Wang e2328abedc fix(canvas): add role=status + aria-live to remaining loading states
Three loading-state divs were missing the role/aria pattern that
TemplatePalette.tsx and EmptyState.tsx already follow. Screen readers
get no signal that the page is waiting:

- canvas/src/app/page.tsx — full-screen "Loading canvas..." while
  the websocket hydrates. First paint of the entire app.
- canvas/src/components/settings/TokensTab.tsx — "Loading tokens..."
- canvas/src/components/settings/OrgTokensTab.tsx — "Loading keys..."

Add role="status" + aria-live="polite" to the wrapping div so
assistive tech announces the wait and the eventual transition.
Visual rendering unchanged.
2026-05-03 07:11:48 -07:00
github-actions[bot] bdad75ae3e Merge pull request #2582 from Molecule-AI/staging
staging → main: auto-promote 90ba2cd
2026-05-03 07:06:58 -07:00
Hongming Wang 90ba2cd4df Merge pull request #2580 from Molecule-AI/auto-sync/main-b002247f
chore: sync main → staging (auto, ff to b002247f)
2026-05-03 13:54:03 +00:00
Hongming Wang b002247f12 Merge pull request #2576 from Molecule-AI/staging
staging → main: auto-promote effbcd7
2026-05-03 06:44:31 -07:00
Hongming Wang 03bcce3eb3 Merge pull request #2574 from Molecule-AI/auto-sync/main-55d85147
chore: sync main → staging (auto, ff to 55d85147)
2026-05-03 13:18:34 +00:00
Hongming Wang c74e71d604 Merge branch 'staging' into auto-sync/main-55d85147 2026-05-03 06:07:20 -07:00
Hongming Wang d7f88674d8 Merge pull request #2577 from Molecule-AI/fix/canvas-tier-legend-t3-t4-contract
fix(canvas): align tier text contracts with 4-tier reality (T1/T2/T3/T4)
2026-05-03 12:58:52 +00:00
Hongming Wang 7abb94dab8 fix(canvas): align tier text contracts with 4-tier reality (T1/T2/T3/T4)
The tier system in CreateWorkspaceDialog and design-tokens has been
T1 Sandboxed / T2 Standard / T3 Privileged / T4 Full Access, but two
chrome surfaces still showed the older 3-tier mapping with T3 as
"Full Access":

- Legend (bottom-left chrome on every canvas page) listed only T1/T2/T3
  and called T3 "Full Access". On a SaaS tenant the actual workspace
  badges render T4 (in amber/warm) — there was no T4 entry in the
  legend at all, so the user sees an undocumented orange badge.

- ConfigTab tier dropdown (per-workspace settings → Sandboxing) had no
  T4 option at all and called T3 "Full Access". So an existing T4
  workspace would show "T3 — Full Access" as the selected option,
  silently downgrading the displayed tier on the settings panel.

- tenant.ts isSaaSTenant() doc comment claimed SaaS workspaces are
  "inherently T3 Full Access" — wrong on both the number and the lock
  rationale (SaaS hides T1/T2/T3, not just T1/T2).

Fix:
- Legend now imports TIER_CONFIG and renders all four tiers
  (Sandboxed/Standard/Privileged/Full Access) using the same color
  swatches as the badges on workspace cards. Eliminates the previous
  drift where Legend's hardcoded sky/violet/warm chips didn't match
  the gray/sky/violet/amber actually rendered on nodes.
- ConfigTab adds the missing T4 — Full Access option and renames T3
  to Privileged.
- tenant.ts comment updated to match the picker's actual hide list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 05:56:18 -07:00
Hongming Wang effbcd737b Merge pull request #2575 from Molecule-AI/fix/cascade-include-all-active-templates
fix(publish-runtime): re-add 5 templates wrongly removed from cascade — fixes #2566
2026-05-03 12:45:48 +00:00
Hongming Wang 6eb79adfd5 manifest: re-add 5 workspace templates pruned by #2536
The cascade-list-vs-manifest drift gate (PR #2556's behavior-based
test) caught my previous-commit cascade additions as 'extra-in-cascade'.
Manifest is the source of truth — restoring there.

All 5 templates have successful publish-image runs in the past 24h
(verified before the cascade fix), and continuous-synth-e2e defaults
to langgraph as its primary canary. None deprecated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 05:43:07 -07:00
Hongming Wang 8f48a38550 fix(publish-runtime): re-add 5 templates wrongly removed from cascade (#2566)
The PR #2536 cascade prune ('deprecated, no shipping images') was
empirically wrong. Re-confirmed 2026-05-03:

- continuous-synth-e2e.yml defaults to langgraph as its primary canary
- All 5 'deprecated' templates have successful publish-image runs in
  the past 24h: langgraph, crewai, autogen, deepagents, gemini-cli

Symptom this fixes — issue #2566 (priority-high, failing 36+h):

  Synthetic E2E (staging): langgraph adapter A2A failure
  'Received Message object in task mode' — failing for >36h

Today at 11:06 commit e1628c4 fixed the underlying a2a-sdk strict-mode
issue in workspace/a2a_executor.py. publish-runtime fired at 11:13 and
cascaded — but only to claude-code, hermes, openclaw, codex. langgraph
was excluded by the prune, so its image stayed on the broken runtime
and the synth E2E (which defaults to langgraph) kept failing despite
the fix being live in PyPI.

After this lands + the next runtime publish fires, langgraph image
re-bakes with the fix and synth-E2E goes green.

Test plan:

- [x] yaml-validate the workflow
- [ ] After merge, watch publish-runtime cascade to all 9 templates
- [ ] Confirm langgraph publish-image fires + succeeds
- [ ] Confirm next continuous-synth-e2e run goes green

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 05:41:53 -07:00
github-actions[bot] 55d85147f7 Merge pull request #2573 from Molecule-AI/staging
staging → main: auto-promote dc6425f
2026-05-03 05:34:23 -07:00
github-actions[bot] f7e8f98cf7 Merge pull request #2570 from Molecule-AI/staging
staging → main: auto-promote 173e22e
2026-05-03 12:22:52 +00:00
Hongming Wang dc6425fe39 Merge pull request #2571 from Molecule-AI/fix/synth-e2e-model-slug-by-runtime
fix(synth-e2e): branch MODEL_SLUG by runtime so langgraph gets colon-form
2026-05-03 12:22:19 +00:00
Hongming Wang cbc69f5e7e fix(synth-e2e): branch MODEL_SLUG by runtime so langgraph gets colon-form
The original script hardcoded `MODEL_SLUG="openai/gpt-4o"` (slash) and
claimed "non-hermes runtimes ignore the prefix" — wrong for langgraph,
which delegates model resolution to langchain's `init_chat_model`. That
function requires `<provider>:<model>` (colon) and treats slash-form as
OpenRouter routing, falling through without auth even when
OPENAI_API_KEY is set.

Surfaced 2026-05-03 after the a2a-sdk v1 contract bugs (PR
#2558+#2563+#2567) cleared the masking layers — synth-E2E firing
2026-05-03T12:14 returned a properly-shaped task with state=failed +
"Could not resolve authentication method" inside the agent body.

continuous-synth-e2e.yml defaults E2E_RUNTIME=langgraph for the cron,
so every firing hit this. Hermes still gets the slash-form it
needs; claude-code uses the entry-id pattern.

Adds E2E_MODEL_SLUG override for operator-dispatched runs that want
to pin a specific slug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 05:17:55 -07:00
Hongming Wang c71f641b12 Merge pull request #2569 from Molecule-AI/fix/redeploy-canary-default
ci(redeploy): fix stale canary_slug default 'hongmingwang' → 'hongming'
2026-05-03 12:08:26 +00:00
Hongming Wang 173e22e091 Merge pull request #2568 from Molecule-AI/auto-sync/main-c0838d63
chore: sync main → staging (auto, ff to c0838d63)
2026-05-03 12:07:29 +00:00
Hongming Wang 60a516bc8d ci(redeploy): fix stale canary_slug default 'hongmingwang' → 'hongming'
The workflow_dispatch input default and the workflow_run env fallback
both pointed at 'hongmingwang', which doesn't match any current prod
tenant (slugs are: hongming, chloe-dong, reno-stars). CP silently
skipped the missing canary and put every tenant in batch-1 in parallel,
defeating the canary-first soak gate that exists to catch image-boot
regressions before they hit the whole fleet.

Concrete example from today's c0838d6 redeploy at 11:53Z (run 25278434388):
the dispatched body was `{"target_tag":"staging-c0838d6","canary_slug":"hongmingwang",...}`
and the CP response showed all 3 tenants in `"phase":"batch-1"` — no
soak, no canary. The deploy happened to be safe, but a broken image
would have hit hongming + chloe-dong + reno-stars simultaneously.

Fixed in three places: the runtime ordering comment, the
workflow_dispatch default, and the env fallback used by the
workflow_run trigger. Comment documents the rationale so the next
slug rename doesn't silently regress this again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 05:06:01 -07:00
Hongming Wang c0838d637e Merge pull request #2562 from Molecule-AI/staging
staging → main: auto-promote bb63e60
2026-05-03 04:49:36 -07:00
Hongming Wang 493ab2566e Merge pull request #2567 from Molecule-AI/fix/synth-e2e-openai-key
ci(synth-e2e): wire MOLECULE_STAGING_OPENAI_KEY into provisioned tenant
2026-05-03 11:45:17 +00:00
Hongming Wang 5e46ea70d6 ci(synth-e2e): wire MOLECULE_STAGING_OPENAI_KEY into provisioned tenant
The synth-E2E (#2342) provisions a langgraph tenant whose default
model `openai:gpt-4.1-mini` requires OPENAI_API_KEY for the first LLM
call. Sibling workflows already wire this:
- e2e-staging-saas.yml:89
- canary-staging.yml:63

continuous-synth-e2e.yml just forgot. Result: tenant boots, accepts
a2a messages, then returns:

  Agent error: "Could not resolve authentication method. Expected
  either api_key or auth_token to be set."

This was masked since 2026-04-29 (workflow creation) by a2a-sdk v0→v1
contract violations — PR #2558 (Task-enqueue) and #2563
(TaskUpdater.complete/failed terminal events) cleared those, exposing
the underlying auth gap on the synth-E2E firing at 11:39 UTC today.

The script tests/e2e/test_staging_full_saas.sh:325 already reads
E2E_OPENAI_API_KEY and persists it as a workspace_secret on tenant
create — only the workflow wiring was missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:43:07 -07:00
Hongming Wang 5cf3dc4369 Merge pull request #2565 from Molecule-AI/fix/redeploy-soft-warn-rt-e2e-prefix
ci(deploy): broaden ephemeral-prefix matchers to cover rt-e2e-*
2026-05-03 11:30:52 +00:00
Hongming Wang 596e797dca ci(deploy): broaden ephemeral-prefix matchers to cover rt-e2e-*
The redeploy-tenants-on-staging soft-warn filter and the
sweep-stale-e2e-orgs janitor both hardcoded `^e2e-` to identify
ephemeral test tenants. Runtime-test harness fixtures (RFC #2251)
mint slugs prefixed with `rt-e2e-`, which neither matcher recognized.

Concrete impact observed today:
  - Two `rt-e2e-v{5,6}-*` tenants left orphaned 8h on staging
    (sweep-stale-e2e-orgs ignored them).
  - On the next staging redeploy their phantom EC2s returned
    `InvalidInstanceId: Instances not in a valid state for account`
    from SSM SendCommand → CP returned HTTP 500 + ok=false.
  - The redeploy soft-warn missed them too, so the workflow went
    red, which broke the auto-promote-staging chain feeding the
    canvas warm-paper rollout to prod.

Fix: switch both matchers to recognize the alternation
`^(e2e-|rt-e2e-)`. Long-lived prefixes (demo-prep, dryrun-*, dryrun2-*)
remain non-ephemeral and continue to hard-fail. Comment documents
the source-of-truth list and the cross-file invariant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:28:29 -07:00
Hongming Wang 3ce638d6e6 Merge pull request #2564 from Molecule-AI/fix/canvas-react-flow-color-mode
fix(canvas): wire ReactFlow colorMode to resolvedTheme
2026-05-03 11:14:13 +00:00
Hongming Wang df7edfcd3f fix(canvas): wire ReactFlow colorMode to resolvedTheme
PR #2555 (Tailwind v4 + warm-paper) migrated all canvas chrome (toolbar,
side panel, modal layer) to semantic tokens, but missed the React Flow
viewport's `colorMode="dark"` literal — and two paired hardcoded dark
literals on the Background dot color and MiniMap mask. Net result on
prod: the user picked light mode, the toolbar flipped warm-paper, but
the canvas backplate, edges, dots, controls, and minimap stayed black —
visibly half-themed.

Three coordinated fixes inside the canvas viewport:
- ReactFlow `colorMode={resolvedTheme}` so the library's own dark/light
  styles flip with the user's choice.
- Background dot color picks the line-soft tone in light mode (zinc-800
  was invisible-on-cream).
- MiniMap maskColor warm-tints the off-viewport dim so the unselected
  region doesn't render as a hard black bar over warm-paper.

Verification:
- `npx tsc --noEmit` clean
- `npx vitest run` 188/188 pass
- (will browser-verify post-redeploy on hongming.moleculesai.app)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:11:35 -07:00
Hongming Wang 3ecb25eb4f Merge pull request #2563 from Molecule-AI/fix/a2a-v1-terminal-event
fix(a2a): route terminal Message via TaskUpdater.complete/failed in task mode
2026-05-03 11:09:09 +00:00
Hongming Wang e1628c4d56 fix(a2a): route terminal Message via TaskUpdater.complete/failed in task mode
PR #2558 enqueued a Task at the start of new requests so the v1 SDK
would accept TaskUpdater.start_work() — fix #1 of the v0→v1 migration
gap (PR #2170). But after Task is enqueued, the executor enters
"task mode" and the SDK rejects raw Message enqueues at the terminal
step:

  {"code":-32603,"message":"Received Message object in task mode.
  Use TaskStatusUpdateEvent or TaskArtifactUpdateEvent instead."}

Synth-E2E 2026-05-03T11:00:34Z surfaced this on the very first run
after the prior fix cascaded. Validation site is the same
a2a/server/agent_execution/active_task.py — the framework's job is
to enforce the v1 invariant; we're catching up to it.

The fix routes both terminal events through TaskUpdater helpers:
- success: updater.complete(message=msg) wraps in
  TaskStatusUpdateEvent(state=COMPLETED, final=True)
- error: updater.failed(message=...) wraps in
  TaskStatusUpdateEvent(state=FAILED, final=True)

Both helpers exist in a2a-sdk ≥ 1.0; verified via
TaskUpdater.complete signature.

Tests:
- conftest TaskUpdater stub now records complete/failed calls AND
  routes the message back through event_queue.enqueue_event so the
  ~20 legacy tests asserting on enqueue_event keep working
- 2 new regression tests pin the contract:
  * test_terminal_success_routes_via_updater_complete
  * test_terminal_error_routes_via_updater_failed
- Both NEW tests verified to FAIL on staging-baseline (without this
  fix) and PASS with it — they'd catch the regression before staging
  if the wheel-smoke gate covered task-mode terminal events too
  (separate yak-shave for #131 follow-up)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 04:06:45 -07:00
Hongming Wang 78721f7a42 Merge pull request #2561 from Molecule-AI/fix/cascade-list-drift-gate
feat(ci): structural drift gate for cascade list vs manifest (RFC #388 PR-3)
2026-05-03 10:55:08 +00:00
Hongming Wang 09010212a0 feat(ci): structural drift gate for cascade list vs manifest (RFC #388 PR-3)
Closes the recurrence path of PR #2556. The data fix realigned 8→4
templates in publish-runtime.yml's TEMPLATES variable, but the
underlying drift hazard was unguarded — the next manifest change
could silently leave cascade out of sync again.

This gate fails any PR that changes manifest.json or
publish-runtime.yml in a way that makes the cascade list diverge
from manifest workspace_templates (suffix-stripped). Either
direction is caught:

  missing-from-cascade  templates that won't auto-rebuild on a new
                       wheel publish (the codex-stuck-on-stale-runtime
                       bug class — PR #2512 added codex to manifest,
                       cascade wasn't updated, codex stayed pinned to
                       its last-built runtime version for weeks).

  extra-in-cascade     cascade dispatches to deprecated templates
                       (the wasted-API-calls + dead-CI-noise class —
                       PR #2536 pruned 5 templates from manifest;
                       cascade kept dispatching to all 8 until
                       PR #2556).

Triggers narrowly: only on PRs that touch manifest.json,
publish-runtime.yml, or the script itself. Fast (single grep+sed+comm
pipeline, no Go build).

Surfaced during the RFC #388 prior-art audit; folded in as the
structural follow-up to the data fix #2556 promised.

Self-tested both failure modes locally before commit:
  - Drop codex from cascade → script fails with "MISSING: codex"
  - Add langgraph to cascade → script fails with "EXTRA: langgraph"

Refs: https://github.com/Molecule-AI/molecule-controlplane/issues/388

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:52:39 -07:00
Hongming Wang bb63e60114 Merge pull request #2560 from Molecule-AI/fix/preflight-smoke-mode-bypass
fix(preflight): skip required_env check in MOLECULE_SMOKE_MODE
2026-05-03 10:46:20 +00:00
Hongming Wang 06240ab67b fix(preflight): skip required_env check in MOLECULE_SMOKE_MODE
Boot smoke (#2275) exercises executor.execute() against stub deps
and never hits the real provider, so missing auth env is not a real
blocker. Without this bypass, every adapter that introduces a new
auth env var must be mirrored into molecule-ci's fake-env list — a
maintenance treadmill that just bit hermes-template:

- 2026-05-03 09:47 UTC: hermes publish-image smoke fails on
  HERMES_API_KEY preflight (workflow injects CLAUDE_CODE_OAUTH_TOKEN,
  ANTHROPIC_API_KEY, GEMINI_API_KEY, OPENAI_API_KEY but not
  HERMES_API_KEY or OPENROUTER_API_KEY). Failed for two cycles
  before being noticed.

The bypass demotes Required-env failures to warnings when
MOLECULE_SMOKE_MODE is truthy, so the unset env stays visible in
the boot log without blocking. Production paths are unchanged
(env unset → fail).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 03:44:05 -07:00
24 changed files with 640 additions and 117 deletions
+94 -44
View File
@@ -111,7 +111,60 @@ jobs:
all_green: ${{ steps.gates.outputs.all_green }}
head_sha: ${{ steps.gates.outputs.head_sha }}
steps:
# Skip empty-tree promotes (the perpetual auto-promote↔auto-sync cycle
# observed 2026-05-03). Sequence: auto-promote merges via the staging
# merge-queue's MERGE strategy, creating a merge commit on main that
# staging doesn't have. auto-sync then merges main back into staging
# via another merge commit (the queue's MERGE strategy applies on
# the staging side too, even when the workflow's local FF would
# have sufficed). Now staging has a new merge-commit SHA whose
# tree == main's tree — but auto-promote sees "staging ahead of
# main by 1" and opens YET another empty promote PR. Each round
# costs ~30-40 min wallclock, ~2 manual approvals, and burns a
# full CodeQL Go run (~15 min). Without this guard the cycle
# repeats indefinitely.
#
# Long-term fix is to switch the merge_queue ruleset's
# `merge_method` away from MERGE so FF-able PRs land cleanly,
# but that's a broader change affecting every staging PR's
# commit shape. This guard is the one-line surgical fix that
# breaks the cycle without touching merge-queue config.
#
# Fail-open: if `git diff` errors for any reason, fall through
# to the gate check (preserve existing behavior). Only skip
# when the diff is DEFINITIVELY empty.
- name: Checkout for tree-diff check
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
ref: staging
- name: Skip if staging tree == main tree (perpetual-cycle break)
id: tree-diff
env:
HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
run: |
set -eu
git fetch origin main --depth=50 || { echo "::warning::git fetch main failed — proceeding (fail-open)"; exit 0; }
# Compare staging tip's tree against main's tree. `git diff
# --quiet` exits 0 if no differences, 1 if there are.
if git diff --quiet origin/main "$HEAD_SHA" -- 2>/dev/null; then
{
echo "## ⏭ Skipped — no code to promote"
echo
echo "staging tip (\`${HEAD_SHA:0:8}\`) and \`main\` have identical trees."
echo "This is the auto-promote↔auto-sync merge-commit cycle: staging has a"
echo "new SHA (a sync-back merge commit) but the underlying file tree is"
echo "already on main, so there's no real code to ship."
echo
echo "Skipping to avoid opening an empty promote PR. Cycle terminates here."
} >> "$GITHUB_STEP_SUMMARY"
echo "::notice::auto-promote: staging tree == main tree — no code to promote, skipping"
echo "skip=true" >> "$GITHUB_OUTPUT"
else
echo "skip=false" >> "$GITHUB_OUTPUT"
fi
- name: Check all required gates on this SHA
if: steps.tree-diff.outputs.skip != 'true'
id: gates
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
@@ -209,10 +262,25 @@ jobs:
exit 0
fi
# Mint the App token BEFORE the promote-PR step so the auto-merge
# call can use it. GITHUB_TOKEN-initiated merges suppress the
# downstream `push` event on main, breaking the
# publish-workspace-server-image → canary-verify → redeploy-tenants
# chain (issue #2357). Using the App token here means the
# merge-queue-landed merge IS able to fire the cascade naturally;
# the polling tail below stays as defense-in-depth.
- name: Mint App token for promote-PR + downstream dispatch
if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
id: app-token
uses: actions/create-github-app-token@1b10c78c7865c340bc4f6099eb2f838309f1e8c3 # v3.1.1
with:
app-id: ${{ secrets.MOLECULE_AI_APP_ID }}
private-key: ${{ secrets.MOLECULE_AI_APP_PRIVATE_KEY }}
- name: Open (or reuse) staging → main promote PR + enable auto-merge
if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GH_TOKEN: ${{ steps.app-token.outputs.token }}
REPO: ${{ github.repository }}
TARGET_SHA: ${{ needs.check-all-gates-green.outputs.head_sha }}
run: |
@@ -267,52 +335,34 @@ jobs:
echo "promote_pr_num=${PR_NUM}" >> "$GITHUB_OUTPUT"
id: promote_pr
# Mint a short-lived GitHub App installation token for the dispatch
# step below. We CANNOT use `secrets.GITHUB_TOKEN` to dispatch the
# downstream publish chain — workflow runs created by GITHUB_TOKEN
# do not fire `workflow_run` triggers on completion (the
# documented "no recursion" rule —
# https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
#
# Symptom this caused (root-caused on 2026-04-30): publish-image
# ran successfully twice (21313dc 14:41Z, 59dec57 15:21Z) but
# canary-verify and redeploy-tenants-on-main never chained,
# because the publish run's `triggering_actor` was
# `github-actions[bot]` (i.e. GITHUB_TOKEN). A manual dispatch
# earlier in the day with the operator's PAT (d850ec7 06:52Z) did
# chain — same workflow file, only the actor differed.
#
# An App token's triggering_actor is the App user (e.g.
# `molecule-ai[bot]`), which IS allowed to fire downstream
# workflow_run cascades.
- name: Mint App token for downstream dispatch
if: steps.promote_pr.outputs.promote_pr_num != ''
id: app-token
uses: actions/create-github-app-token@1b10c78c7865c340bc4f6099eb2f838309f1e8c3 # v3.1.1
with:
app-id: ${{ secrets.MOLECULE_AI_APP_ID }}
private-key: ${{ secrets.MOLECULE_AI_APP_PRIVATE_KEY }}
# The App token minted above (before the promote-PR step) is
# also used by the polling tail below. Defense-in-depth: with
# the merge-queue-landed merge now using the App token, the
# main-branch push event SHOULD fire the publish/canary/redeploy
# cascade naturally — but if for any reason it doesn't (e.g. an
# unrelated event-suppression edge case), the explicit dispatches
# below still wake the chain.
- name: Wait for promote merge, then dispatch publish + redeploy (#2357)
# GITHUB_TOKEN-initiated merges suppress downstream `push` events
# (https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
# Result: when the merge queue lands the promote PR, the resulting
# main-branch push DOES NOT fire publish-workspace-server-image,
# so canary-verify and redeploy-tenants-on-main never run and
# tenants stay on stale code (issue #2357).
# Defense-in-depth dispatch. With the auto-merge call above
# now using the App token (this commit), the merge-queue-landed
# merge SHOULD fire publish-workspace-server-image naturally
# via on:push:[main] — App-token-initiated pushes DO trigger
# workflow_run cascades, unlike GITHUB_TOKEN-initiated ones
# (the documented "no recursion" rule —
# https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
#
# Workaround: poll for the merge to land, then explicitly
# `gh workflow run` publish-workspace-server-image. The dispatch
# MUST authenticate as the molecule-ai App (App token minted
# above) — not GITHUB_TOKEN — so that the resulting publish
# run's completion event can fire the workflow_run cascade
# into canary-verify + redeploy-tenants-on-main. See the prior
# step's comment for the GITHUB_TOKEN no-recursion details.
# This explicit dispatch stays as belt-and-suspenders for any
# edge case where the natural cascade misfires. If it never
# observably fires after this token swap (i.e. the publish
# workflow has already started by the time we get here), the
# second dispatch is a harmless no-op (publish-workspace-server-image
# has its own concurrency group that dedupes).
#
# Long-term fix: switch the auto-merge call above to use the
# same App token, so the merge's push event fires
# publish-workspace-server-image naturally and this polling tail
# becomes unnecessary. Tracked in #2357.
# See PR for #2357: pre-fix the merge action was via
# GITHUB_TOKEN, suppressing the cascade and forcing this tail
# to be the SOLE chain trigger. With the auto-merge token swap
# the tail becomes redundant in the happy path; keep until
# we've observed >=10 successful natural cascades, then drop.
if: steps.promote_pr.outputs.promote_pr_num != ''
env:
GH_TOKEN: ${{ steps.app-token.outputs.token }}
@@ -0,0 +1,39 @@
name: cascade-list-drift-gate
# Structural gate: TEMPLATES list in publish-runtime.yml must match
# manifest.json's workspace_templates exactly. Closes the recurrence
# path of PR #2556 (the data fix) and is the first concrete deliverable
# of RFC #388 PR-3.
#
# Why a gate, not just discipline: PR #2536 pruned the manifest, but the
# cascade list wasn't updated for ~weeks before someone (PR #2556)
# noticed during an unrelated audit. During that window, codex never
# rebuilt on a runtime publish. A structural gate catches the drift
# the same day either file changes.
#
# Triggers narrowly to keep CI quiet: only on PRs that actually change
# one of the two files. The path-filtered split + always-emit-result
# pattern (memory: "Required check names need a job that always runs")
# is unnecessary here because the workflow IS the check name and PR
# branch protection should require it directly. Future-proof: if this
# becomes a required check, add a no-op aggregator with always() so the
# name still emits when paths don't match.
on:
pull_request:
branches: [staging, main]
paths:
- manifest.json
- .github/workflows/publish-runtime.yml
- scripts/check-cascade-list-vs-manifest.sh
permissions:
contents: read
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
- name: Check cascade list matches manifest
run: bash scripts/check-cascade-list-vs-manifest.sh
@@ -88,6 +88,15 @@ jobs:
E2E_KEEP_ORG: ${{ github.event.inputs.keep_org == 'true' && '1' || '' }}
MOLECULE_CP_URL: ${{ vars.STAGING_CP_URL || 'https://staging-api.moleculesai.app' }}
MOLECULE_ADMIN_TOKEN: ${{ secrets.CP_STAGING_ADMIN_API_TOKEN }}
# Provisioned tenant's default model (langgraph: openai:gpt-4.1-mini)
# needs OPENAI_API_KEY at first call. Sibling workflows
# e2e-staging-saas.yml + canary-staging.yml use the same secret;
# without this wire-up the tenant boots, accepts a2a messages,
# then returns "Could not resolve authentication method" — masked
# earlier by the a2a-sdk task-mode contract bugs PR #2558+#2563
# fixed. tests/e2e/test_staging_full_saas.sh:325 reads this and
# persists it as a workspace_secret on tenant create.
E2E_OPENAI_API_KEY: ${{ secrets.MOLECULE_STAGING_OPENAI_KEY }}
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+13 -7
View File
@@ -327,13 +327,19 @@ jobs:
echo "::error::publish job did not expose a version output — cascade cannot fan out"
exit 1
fi
# Source of truth: manifest.json workspace_templates (PR #2536 pruned
# to 4 actively-supported runtimes: claude-code, hermes, openclaw, codex).
# Removed langgraph/crewai/autogen/deepagents/gemini-cli (deprecated, no
# shipping images); added codex (had been missing since #2512).
# Long-term: derive this list from manifest.json so the cascade can't
# drift again — tracked in RFC #388 as a Phase-1 invariant.
TEMPLATES="claude-code hermes openclaw codex"
# All 9 active workspace template repos. The PR #2536 pruning
# ("deprecated, no shipping images") was empirically wrong:
# continuous-synth-e2e.yml defaults to langgraph as its primary
# canary (line 44), and every excluded template had successful
# publish-image runs as of 2026-05-03 — none were dormant.
# Symptom of the prune: today's a2a-sdk strict-mode fix
# (#2566 / commit e1628c4) cascaded to 4 templates but never
# reached langgraph, so the synth-E2E correctly canary'd a fix
# that had landed but not deployed. Re-added the 5 templates.
# Long-term: derive this list from manifest.json so cascade
# scope can't drift from E2E scope — tracked in RFC #388 as a
# Phase-1 invariant.
TEMPLATES="claude-code hermes openclaw codex langgraph crewai autogen deepagents gemini-cli"
FAILED=""
for tpl in $TEMPLATES; do
REPO="Molecule-AI/molecule-ai-workspace-template-$tpl"
@@ -17,7 +17,7 @@ name: redeploy-tenants-on-main
# 1. publish-workspace-server-image completes → new :latest in GHCR.
# 2. This workflow fires via workflow_run, waits 30s for GHCR's
# CDN to propagate the new tag to the region the tenants pull from.
# 3. Calls redeploy-fleet with canary_slug=hongmingwang and a 60s
# 3. Calls redeploy-fleet with canary_slug=hongming and a 60s
# soak. Canary proves the image boots; batches follow.
# 4. Any failure aborts the rollout and leaves older tenants on the
# prior image — safer default than half-and-half state.
@@ -56,7 +56,12 @@ on:
description: 'Tenant slug to deploy first + soak (empty = skip canary, fan out immediately).'
required: false
type: string
default: 'hongmingwang'
# Must be an actual prod tenant slug (current: hongming,
# chloe-dong, reno-stars). The previous default 'hongmingwang'
# didn't match any tenant — CP soft-skipped the missing canary
# and the fleet rolled out without the soak gate, defeating the
# whole point of canary-first.
default: 'hongming'
soak_seconds:
description: 'Seconds to wait after canary before fanning out.'
required: false
@@ -148,7 +153,7 @@ jobs:
CP_URL: ${{ vars.CP_URL || 'https://api.moleculesai.app' }}
CP_ADMIN_API_TOKEN: ${{ secrets.CP_ADMIN_API_TOKEN }}
TARGET_TAG: ${{ steps.tag.outputs.target_tag }}
CANARY_SLUG: ${{ inputs.canary_slug || 'hongmingwang' }}
CANARY_SLUG: ${{ inputs.canary_slug || 'hongming' }}
SOAK_SECONDS: ${{ inputs.soak_seconds || '60' }}
BATCH_SIZE: ${{ inputs.batch_size || '3' }}
DRY_RUN: ${{ inputs.dry_run || false }}
@@ -176,35 +176,41 @@ jobs:
#
# CP returns HTTP 500 + ok=false whenever ANY tenant in the
# fleet failed SSM or healthz. In practice the recurring source
# of these is ephemeral e2e-* tenants (saas/canvas/ext) being
# torn down by their parent E2E run mid-redeploy: the EC2 dies →
# SSM exit=2 or healthz timeout → CP marks the fleet failed →
# this workflow goes red even though every operator-facing
# tenant rolled fine.
# of these is ephemeral test tenants being torn down by their
# parent E2E run mid-redeploy: the EC2 dies → SSM exit=2 or
# healthz timeout → CP marks the fleet failed → this workflow
# goes red even though every operator-facing tenant rolled fine.
#
# Filter: if HTTP=500/ok=false AND every failed slug matches
# ^e2e-, treat as soft-warn and let the verify step downstream
# handle the unreachable-vs-stale distinction (it already knows
# the difference per #2402). Any non-e2e-* failure or a non-500
# HTTP response remains a hard failure.
# Ephemeral slug prefixes (kept in sync with sweep-stale-e2e-orgs.yml
# — see that file for the source-of-truth list and rationale):
# - e2e-* — canvas/saas/ext E2E suites
# - rt-e2e-* — runtime-test harness fixtures (RFC #2251)
# Long-lived prefixes that are NOT ephemeral and MUST hard-fail:
# demo-prep, dryrun-*, dryrun2-*, plus all human tenant slugs.
#
# Filter: if HTTP=500/ok=false AND every failed slug matches an
# ephemeral prefix, treat as soft-warn and let the verify step
# downstream handle unreachable-vs-stale (#2402). Any non-ephemeral
# failure or a non-500 HTTP response remains a hard failure.
OK=$(jq -r '.ok // "false"' "$HTTP_RESPONSE")
FAILED_SLUGS=$(jq -r '
.results[]?
| select((.healthz_ok != true) or (.ssm_status != "Success"))
| .slug' "$HTTP_RESPONSE" 2>/dev/null || true)
NON_E2E_FAILED=$(printf '%s\n' "$FAILED_SLUGS" | grep -v '^$' | grep -v '^e2e-' || true)
EPHEMERAL_PREFIX_RE='^(e2e-|rt-e2e-)'
NON_EPHEMERAL_FAILED=$(printf '%s\n' "$FAILED_SLUGS" | grep -v '^$' | grep -Ev "$EPHEMERAL_PREFIX_RE" || true)
if [ "$HTTP_CODE" = "200" ] && [ "$OK" = "true" ]; then
: # happy path — fall through to verification
elif [ "$HTTP_CODE" = "500" ] && [ -z "$NON_E2E_FAILED" ] && [ -n "$FAILED_SLUGS" ]; then
COUNT=$(printf '%s\n' "$FAILED_SLUGS" | grep -c '^e2e-' || true)
echo "::warning::redeploy-fleet returned HTTP 500 but every failed tenant ($COUNT) is e2e-* ephemeral — treating as teardown race, soft-warning."
elif [ "$HTTP_CODE" = "500" ] && [ -z "$NON_EPHEMERAL_FAILED" ] && [ -n "$FAILED_SLUGS" ]; then
COUNT=$(printf '%s\n' "$FAILED_SLUGS" | grep -Ec "$EPHEMERAL_PREFIX_RE" || true)
echo "::warning::redeploy-fleet returned HTTP 500 but every failed tenant ($COUNT) is ephemeral (e2e-*/rt-e2e-*) — treating as teardown race, soft-warning."
printf '%s\n' "$FAILED_SLUGS" | sed 's/^/::warning:: failed: /'
elif [ "$HTTP_CODE" != "200" ]; then
echo "::error::redeploy-fleet returned HTTP $HTTP_CODE"
if [ -n "$NON_E2E_FAILED" ]; then
echo "::error::non-e2e tenant(s) failed:"
printf '%s\n' "$NON_E2E_FAILED" | sed 's/^/::error:: /'
if [ -n "$NON_EPHEMERAL_FAILED" ]; then
echo "::error::non-ephemeral tenant(s) failed:"
printf '%s\n' "$NON_EPHEMERAL_FAILED" | sed 's/^/::error:: /'
fi
exit 1
else
+15 -4
View File
@@ -26,11 +26,22 @@ jobs:
runs-on: ubuntu-latest
# Only fire for bot-authored PRs. Human CEO PRs (staging→main promotion)
# are intentional and pass through.
#
# Head-ref guard: never retarget a PR whose head IS `staging` — those
# are the auto-promote staging→main PRs (opened by molecule-ai[bot]
# since #2586 switched to an App token, which now passes the bot
# filter below). Retargeting head=staging onto base=staging fails
# with HTTP 422 "no new commits between base 'staging' and head
# 'staging'", which used to surface as a noisy red workflow run on
# every auto-promote (caught 2026-05-03 on PR #2588).
if: >-
github.event.pull_request.user.type == 'Bot'
|| endsWith(github.event.pull_request.user.login, '[bot]')
|| github.event.pull_request.user.login == 'app/molecule-ai'
|| github.event.pull_request.user.login == 'molecule-ai[bot]'
github.event.pull_request.head.ref != 'staging'
&& (
github.event.pull_request.user.type == 'Bot'
|| endsWith(github.event.pull_request.user.login, '[bot]')
|| github.event.pull_request.user.login == 'app/molecule-ai'
|| github.event.pull_request.user.login == 'molecule-ai[bot]'
)
steps:
- name: Retarget PR base to staging
id: retarget
+11 -3
View File
@@ -87,20 +87,28 @@ jobs:
> orgs.json
# Filter:
# 1. slug starts with 'e2e-' (covers e2e-, e2e-canary-,
# e2e-canvas-* — all variants the test scripts mint)
# 1. slug starts with one of the ephemeral test prefixes:
# - 'e2e-' — covers e2e-canary-, e2e-canvas-*, etc.
# - 'rt-e2e-' — runtime-test harness fixtures (RFC #2251);
# missing this prefix left two such tenants
# orphaned 8h on staging (2026-05-03), then
# hard-failed redeploy-tenants-on-staging
# and broke the staging→main auto-promote
# chain. Kept in sync with the EPHEMERAL_PREFIX_RE
# regex in redeploy-tenants-on-staging.yml.
# 2. created_at is older than MAX_AGE_MINUTES ago
# Output one slug per line to a file the next step reads.
python3 > stale_slugs.txt <<'PY'
import json, os
from datetime import datetime, timezone, timedelta
EPHEMERAL_PREFIXES = ("e2e-", "rt-e2e-")
with open("orgs.json") as f:
data = json.load(f)
max_age = int(os.environ["MAX_AGE_MINUTES"])
cutoff = datetime.now(timezone.utc) - timedelta(minutes=max_age)
for o in data.get("orgs", []):
slug = o.get("slug", "")
if not slug.startswith("e2e-"):
if not slug.startswith(EPHEMERAL_PREFIXES):
continue
created = o.get("created_at")
if not created:
+1 -1
View File
@@ -54,7 +54,7 @@ export default function Home() {
if (hydrating) {
return (
<div className="fixed inset-0 flex items-center justify-center bg-surface">
<div className="flex flex-col items-center gap-3">
<div role="status" aria-live="polite" className="flex flex-col items-center gap-3">
<Spinner size="lg" />
<span className="text-xs text-ink-soft">Loading canvas...</span>
</div>
+16 -3
View File
@@ -13,6 +13,7 @@ import {
import "@xyflow/react/dist/style.css";
import { useCanvasStore } from "@/store/canvas";
import { useTheme } from "@/lib/theme-provider";
import { A2ATopologyOverlay } from "./A2ATopologyOverlay";
import { WorkspaceNode } from "./WorkspaceNode";
import { SidePanel } from "./SidePanel";
@@ -69,6 +70,14 @@ export function Canvas() {
}
function CanvasInner() {
// ReactFlow's `colorMode` prop drives the styling of every viewport
// primitive it renders directly (background dots, edge defaults,
// selection rings, controls, minimap mask). Pre-fix this was hard-pinned
// to "dark" — so on light theme the chrome (toolbar, side panel) flipped
// to warm-paper but the canvas backplate + edges stayed black, leaving a
// half-themed page. Pull resolvedTheme so the canvas matches the user's
// selected mode (and the system preference when they pick "system").
const { resolvedTheme } = useTheme();
const rawNodes = useCanvasStore((s) => s.nodes);
const edges = useCanvasStore((s) => s.edges);
const a2aEdges = useCanvasStore((s) => s.a2aEdges);
@@ -250,7 +259,7 @@ function CanvasInner() {
</a>
<main id="canvas-main" className="w-screen h-screen bg-surface">
<ReactFlow
colorMode="dark"
colorMode={resolvedTheme}
nodes={nodes}
edges={allEdges}
onNodesChange={onNodesChange}
@@ -273,7 +282,9 @@ function CanvasInner() {
variant={BackgroundVariant.Dots}
gap={24}
size={1}
color="#27272a"
// Match the line token so dots fade with the surface.
// Hard-coded zinc-800 was invisible on warm-paper.
color={resolvedTheme === "dark" ? "#27272a" : "#d4d0c4"}
/>
<Controls
className="!bg-surface-sunken/90 !border-line/50 !rounded-lg !shadow-xl !shadow-black/20 [&>button]:!bg-surface-card [&>button]:!border-line/50 [&>button]:!text-ink-mid [&>button:hover]:!bg-surface-card [&>button:hover]:!text-ink"
@@ -281,7 +292,9 @@ function CanvasInner() {
/>
<MiniMap
className="!bg-surface-sunken/90 !border-line/50 !rounded-lg !shadow-xl !shadow-black/20"
maskColor="rgba(0, 0, 0, 0.7)"
// Mask dims off-viewport areas; tint matches the surface so
// the dimming doesn't show as a black bar in light mode.
maskColor={resolvedTheme === "dark" ? "rgba(0, 0, 0, 0.7)" : "rgba(232, 226, 211, 0.7)"}
nodeColor={(node) => {
// Parents show as a filled region — hierarchy visible at
// a glance in the minimap without needing to zoom.
+16 -4
View File
@@ -1,11 +1,23 @@
"use client";
import { useEffect, useState } from "react";
import { STATUS_CONFIG } from "@/lib/design-tokens";
import { STATUS_CONFIG, TIER_CONFIG } from "@/lib/design-tokens";
import { useCanvasStore } from "@/store/canvas";
const LEGEND_STATUSES = ["online", "provisioning", "degraded", "failed", "paused", "offline"] as const;
// Tier descriptions kept in sync with CreateWorkspaceDialog.tsx (the
// source of truth for what each tier means semantically). Colors come
// from TIER_CONFIG so the legend swatch matches the badge actually
// rendered on every WorkspaceNode — drift here misled users into
// thinking the legend documented a different tier than the one shown.
const LEGEND_TIERS: ReadonlyArray<{ tier: number; label: string }> = [
{ tier: 1, label: "Sandboxed" },
{ tier: 2, label: "Standard" },
{ tier: 3, label: "Privileged" },
{ tier: 4, label: "Full Access" },
];
// Persist the user's choice across sessions. Default is "open" so
// first-time users still see the symbol key; once dismissed we
// respect that until they explicitly reopen via the floating pill.
@@ -102,9 +114,9 @@ export function Legend() {
<div className="mb-2">
<div className="text-[11px] text-ink-soft font-medium mb-1">Tier</div>
<div className="flex flex-wrap gap-x-3 gap-y-1">
<TierItem tier={1} label="Sandboxed" color="text-sky-300 bg-sky-950/40 border-sky-700/30" />
<TierItem tier={2} label="Standard" color="text-violet-300 bg-violet-950/40 border-violet-700/30" />
<TierItem tier={3} label="Full Access" color="text-warm bg-amber-950/40 border-amber-700/30" />
{LEGEND_TIERS.map(({ tier, label }) => (
<TierItem key={tier} tier={tier} label={label} color={TIER_CONFIG[tier].border} />
))}
</div>
</div>
@@ -182,7 +182,7 @@ export function OrgTokensTab() {
{/* Token list */}
{loading ? (
<div className="flex items-center justify-center gap-2 py-6 text-ink-soft text-xs">
<div role="status" aria-live="polite" className="flex items-center justify-center gap-2 py-6 text-ink-soft text-xs">
<Spinner /> Loading keys...
</div>
) : tokens.length === 0 ? (
+1 -1
View File
@@ -129,7 +129,7 @@ export function TokensTab({ workspaceId }: TokensTabProps) {
{/* Token list */}
{loading ? (
<div className="flex items-center justify-center gap-2 py-6 text-ink-soft text-xs">
<div role="status" aria-live="polite" className="flex items-center justify-center gap-2 py-6 text-ink-soft text-xs">
<Spinner /> Loading tokens...
</div>
) : tokens.length === 0 ? (
+2 -1
View File
@@ -655,7 +655,8 @@ export function ConfigTab({ workspaceId }: Props) {
>
<option value={1}>T1 Sandboxed</option>
<option value={2}>T2 Standard</option>
<option value={3}>T3 Full Access</option>
<option value={3}>T3 Privileged</option>
<option value={4}>T4 Full Access</option>
</select>
</div>
</div>
+2 -2
View File
@@ -59,8 +59,8 @@ export function getTenantSlug(): string {
* isSaaSTenant reports whether the canvas is running as the UI for a
* SaaS tenant (served at <slug>.moleculesai.app). Use for client-side
* UX branches that should behave differently on SaaS vs self-hosted —
* e.g. the workspace tier picker hides T1/T2 sandbox tiers because every
* SaaS workspace gets its own EC2 VM (inherently T3 Full Access).
* e.g. the workspace tier picker hides T1/T2/T3 sandbox tiers because
* every SaaS workspace gets its own EC2 VM (inherently T4 Full Access).
*
* SSR-safe: returns false on the server to avoid hydration drift; call
* sites should tolerate a flip from false→true on first client render.
+6 -1
View File
@@ -28,7 +28,12 @@
{"name": "claude-code-default", "repo": "Molecule-AI/molecule-ai-workspace-template-claude-code", "ref": "main"},
{"name": "hermes", "repo": "Molecule-AI/molecule-ai-workspace-template-hermes", "ref": "main"},
{"name": "openclaw", "repo": "Molecule-AI/molecule-ai-workspace-template-openclaw", "ref": "main"},
{"name": "codex", "repo": "Molecule-AI/molecule-ai-workspace-template-codex", "ref": "main"}
{"name": "codex", "repo": "Molecule-AI/molecule-ai-workspace-template-codex", "ref": "main"},
{"name": "langgraph", "repo": "Molecule-AI/molecule-ai-workspace-template-langgraph", "ref": "main"},
{"name": "crewai", "repo": "Molecule-AI/molecule-ai-workspace-template-crewai", "ref": "main"},
{"name": "autogen", "repo": "Molecule-AI/molecule-ai-workspace-template-autogen", "ref": "main"},
{"name": "deepagents", "repo": "Molecule-AI/molecule-ai-workspace-template-deepagents", "ref": "main"},
{"name": "gemini-cli", "repo": "Molecule-AI/molecule-ai-workspace-template-gemini-cli", "ref": "main"}
],
"org_templates": [
{"name": "molecule-dev", "repo": "Molecule-AI/molecule-ai-org-template-molecule-dev", "ref": "main"},
+95
View File
@@ -0,0 +1,95 @@
#!/usr/bin/env bash
# check-cascade-list-vs-manifest.sh — structural drift gate for the
# publish-runtime cascade list vs manifest.json workspace_templates.
#
# WHY: PR #2536 pruned the manifest to 4 supported runtimes; PR #2556
# realigned the cascade list to match. The underlying drift hazard
# (cascade-list ≠ manifest) was unguarded — the data fix didn't prevent
# recurrence. This script is the structural gate that does.
#
# Behavior-based per project pattern: derives the expected set from
# manifest.json and the actual set from the workflow YAML, fails on
# any divergence in either direction.
#
# missing-from-cascade → templates in manifest that publish-runtime.yml
# won't auto-rebuild on a new wheel publish
# (the codex-stuck-on-stale-runtime bug class)
# extra-in-cascade → cascade dispatches to deprecated templates
# (the wasted-API-calls + dead-CI-noise class)
#
# Suffix mapping: manifest names map to GHCR repos via
# {name without -default suffix} → molecule-ai-workspace-template-<suffix>
# That's the same map publish-runtime.yml's TEMPLATES variable iterates.
#
# Exit:
# 0 cascade matches manifest exactly
# 1 drift detected (script prints the diff)
# 2 bad usage / missing inputs
set -eu
MANIFEST="${1:-manifest.json}"
WORKFLOW="${2:-.github/workflows/publish-runtime.yml}"
if [ ! -f "$MANIFEST" ]; then
echo "::error::manifest not found: $MANIFEST" >&2
exit 2
fi
if [ ! -f "$WORKFLOW" ]; then
echo "::error::workflow not found: $WORKFLOW" >&2
exit 2
fi
# Expected cascade entries: manifest workspace_templates → suffix-only
# (strip -default tail, e.g. claude-code-default → claude-code, since
# publish-runtime.yml's TEMPLATES uses suffixes that match the
# molecule-ai-workspace-template-<suffix> repo naming).
EXPECTED=$(jq -r '.workspace_templates[].name' "$MANIFEST" \
| sed 's/-default$//' \
| sort -u)
# Actual cascade entries: extract from the TEMPLATES="…" line. We look
# for the line, pull the contents between the quotes, and split into
# one-per-line. Single source of truth in the workflow itself, no
# parallel registry needed.
#
# Why not \s in the regex: BSD sed (macOS) doesn't recognize \s as
# whitespace — treats it as literal `s`. POSIX [[:space:]] works on
# both BSD and GNU sed. Same hazard nuked the original draft of this
# script: \s* matched empty-prefix-of-literal-s, then the leading
# whitespace stayed in the captured group.
ACTUAL=$(grep -E '[[:space:]]*TEMPLATES="' "$WORKFLOW" \
| head -1 \
| sed -E 's/^[[:space:]]*TEMPLATES="([^"]*)".*$/\1/' \
| tr ' ' '\n' \
| grep -v '^$' \
| sort -u)
if [ -z "$ACTUAL" ]; then
echo "::error::could not extract TEMPLATES=\"…\" from $WORKFLOW — has the variable name or quoting changed?" >&2
exit 2
fi
MISSING=$(comm -23 <(printf '%s\n' "$EXPECTED") <(printf '%s\n' "$ACTUAL"))
EXTRA=$(comm -13 <(printf '%s\n' "$EXPECTED") <(printf '%s\n' "$ACTUAL"))
if [ -z "$MISSING" ] && [ -z "$EXTRA" ]; then
echo "✓ cascade list matches manifest workspace_templates ($(echo "$EXPECTED" | wc -l | tr -d ' ') entries)"
exit 0
fi
echo "::error::cascade list drift detected between $MANIFEST and $WORKFLOW" >&2
echo "" >&2
if [ -n "$MISSING" ]; then
echo " Templates in manifest but MISSING from cascade (won't auto-rebuild on wheel publish):" >&2
echo "$MISSING" | sed 's/^/ - /' >&2
echo "" >&2
fi
if [ -n "$EXTRA" ]; then
echo " Templates in cascade but NOT in manifest (deprecated, wasting dispatch calls):" >&2
echo "$EXTRA" | sed 's/^/ - /' >&2
echo "" >&2
fi
echo " Fix: edit the TEMPLATES=\"…\" line in $WORKFLOW so the set matches" >&2
echo " manifest.json's workspace_templates (suffix-stripped). See PR #2556 for context." >&2
exit 1
+36 -9
View File
@@ -352,15 +352,42 @@ print(json.dumps({
")
fi
# Model slug MUST be provider-prefixed for hermesthe template's
# derive-provider.sh parses the slug prefix (`openai/…`, `anthropic/…`,
# `minimax/…`) to set HERMES_INFERENCE_PROVIDER at install time. A bare
# "gpt-4o" has no prefix → provider falls back to hermes auto-detect →
# picks Anthropic default → tries Anthropic API with the OpenAI key →
# 401 on A2A. Same trap that trapped prod users in PR #1714. We pin
# "openai/gpt-4o" here because the E2E's secret is always the OpenAI
# key; non-hermes runtimes ignore the prefix.
MODEL_SLUG="openai/gpt-4o"
# Model slug format depends on the runtime — different model resolvers
# parse it differently:
#
# hermes → "openai/gpt-4o" (slash-form: derive-provider.sh splits
# on the prefix to set
# HERMES_INFERENCE_PROVIDER. Bare
# "gpt-4o" falls through to Anthropic
# default + 401, see PR #1714.)
#
# langgraph → "openai:gpt-4o" (colon-form: langchain init_chat_model
# requires "<provider>:<model>".
# Slash-form was misinterpreted as
# OpenRouter routing → fell through
# without auth, surfaced 2026-05-03
# after the a2a-sdk v1 contract bugs
# PR #2558+#2563+#2567 cleared the
# masking layers.)
#
# claude-code → "sonnet" (entry-id form: claude-code template's
# config.yaml uses bare model names,
# auth comes via CLAUDE_CODE_OAUTH_TOKEN
# or ANTHROPIC_API_KEY rather than the
# slug.)
#
# When E2E_MODEL_SLUG is set, it overrides this dispatch — useful when an
# operator dispatches the workflow to test a specific slug.
if [ -n "${E2E_MODEL_SLUG:-}" ]; then
MODEL_SLUG="$E2E_MODEL_SLUG"
else
case "$RUNTIME" in
hermes) MODEL_SLUG="openai/gpt-4o" ;;
langgraph) MODEL_SLUG="openai:gpt-4o" ;;
claude-code) MODEL_SLUG="sonnet" ;;
*) MODEL_SLUG="openai/gpt-4o" ;; # safest fallback (matches hermes)
esac
fi
log "5/11 Provisioning parent workspace (runtime=$RUNTIME)..."
PARENT_RESP=$(tenant_call POST /workspaces \
+16 -5
View File
@@ -509,7 +509,15 @@ class LangGraphA2AExecutor(AgentExecutor):
# accept the assignment. See #1787 + commit dcbcf19
# for the original test-mock motivation.
logger.debug("metadata attach skipped (non-Message return from new_text_message)")
await event_queue.enqueue_event(msg)
# A2A v1 (a2a-sdk ≥ 1.0): once Task is enqueued (above, PR #2558),
# the executor is in task mode and raw Message enqueues are
# rejected with InvalidAgentResponseError("Received Message
# object in task mode. Use TaskStatusUpdateEvent or
# TaskArtifactUpdateEvent instead."). updater.complete()
# wraps the Message in a terminal TaskStatusUpdateEvent
# (state=COMPLETED, final=True) which both streaming and
# non-streaming clients accept.
await updater.complete(message=msg)
_result = final_text
except Exception as e:
@@ -520,10 +528,13 @@ class LangGraphA2AExecutor(AgentExecutor):
task_span.set_status(StatusCode.ERROR, str(e))
except Exception:
pass
# Emit a Message so both streaming and non-streaming clients
# receive an error response rather than hanging.
await event_queue.enqueue_event(
new_text_message(
# A2A v1: in task mode, terminal errors must publish a
# FAILED TaskStatusUpdateEvent (carrying the error Message)
# rather than a raw Message enqueue. updater.failed() does
# exactly this — both streaming and non-streaming clients
# receive the error and stop polling.
await updater.failed(
message=new_text_message(
f"Agent error: {e}", task_id=task_id, context_id=context_id
)
)
+27 -5
View File
@@ -180,16 +180,38 @@ def run_preflight(config: WorkspaceConfig, config_path: str) -> PreflightReport:
required_env = list(entry.get("required_env") or [])
break
# Smoke mode skips the auth-env block: the boot smoke (CI publish-image,
# issue #2275) exercises executor.execute() against stub deps, never
# hits the real provider, and CI cannot enumerate every adapter's auth
# env without forming a maintenance treadmill. Hermes 2026-05-03 outage:
# template smoke crashed for two cycles because molecule-ci injected
# CLAUDE_CODE_OAUTH_TOKEN/ANTHROPIC_API_KEY/etc. but not HERMES_API_KEY.
# Bypass here means new templates can ship without the workflow
# learning their env names.
smoke_mode = os.environ.get("MOLECULE_SMOKE_MODE", "").strip().lower() in (
"1", "true", "yes", "on",
)
for env_var in required_env:
if not os.environ.get(env_var):
report.failures.append(
if os.environ.get(env_var):
continue
if smoke_mode:
report.warnings.append(
PreflightIssue(
severity="fail",
severity="warn",
title="Required env",
detail=f"Missing required environment variable: {env_var}",
fix=f"Set {env_var} via the secrets API (global or workspace-level).",
detail=f"Missing {env_var} (skipped — MOLECULE_SMOKE_MODE)",
fix="",
)
)
continue
report.failures.append(
PreflightIssue(
severity="fail",
title="Required env",
detail=f"Missing required environment variable: {env_var}",
fix=f"Set {env_var} via the secrets API (global or workspace-level).",
)
)
# Backward compat: if legacy auth_token_file is set, warn but don't block
# if the token is available via required_env or auth_token_env.
+20 -6
View File
@@ -35,27 +35,41 @@ def _make_a2a_mocks():
events_mod.EventQueue = EventQueue
# a2a.server.tasks needs a TaskUpdater stub whose async methods are no-ops.
# In tests, TaskUpdater calls go to this stub rather than the real SDK so
# event_queue.enqueue_event is only called via explicit executor code paths.
# a2a.server.tasks needs a TaskUpdater stub whose async methods are no-ops
# for status transitions but ROUTE the terminal message back through
# event_queue.enqueue_event so legacy assertions on enqueue_event keep
# working. The wrapper preserves identity (the same Message object the
# executor passed in) so tests inspecting str(event_arg) still see the
# response text. complete()/failed() also record their last call on the
# event_queue itself (`_complete_calls`, `_failed_calls`) so the v1
# contract regression test (#262 follow-on to #2558) can pin the proper
# path was taken — raw enqueue from executor would NOT touch these.
tasks_mod = ModuleType("a2a.server.tasks")
class TaskUpdater:
"""Stub TaskUpdater — no-op async methods for unit tests."""
"""Stub TaskUpdater — terminal helpers route through event_queue."""
def __init__(self, event_queue, task_id, context_id, *args, **kwargs):
self.event_queue = event_queue
self.task_id = task_id
self.context_id = context_id
if not hasattr(event_queue, "_complete_calls"):
event_queue._complete_calls = []
if not hasattr(event_queue, "_failed_calls"):
event_queue._failed_calls = []
async def start_work(self, message=None):
pass
async def complete(self, message=None):
pass
self.event_queue._complete_calls.append(message)
if message is not None:
await self.event_queue.enqueue_event(message)
async def failed(self, message=None):
pass
self.event_queue._failed_calls.append(message)
if message is not None:
await self.event_queue.enqueue_event(message)
async def add_artifact(
self, parts, artifact_id=None, name=None, metadata=None,
+78
View File
@@ -1123,3 +1123,81 @@ async def test_no_task_enqueue_on_continuation():
assert not isinstance(event, Task), (
f"continuation must not re-enqueue Task, but got Task at {call}"
)
# ---------------------------------------------------------------------------
# A2A v1 task-mode terminal-event contract (PR #2558 follow-up, task #262)
# ---------------------------------------------------------------------------
# After PR #2558 enqueues a Task at the start of new requests, the executor
# is in v1 "task mode". The SDK then rejects any subsequent raw Message
# enqueue with InvalidAgentResponseError("Received Message object in task
# mode. Use TaskStatusUpdateEvent or TaskArtifactUpdateEvent instead.") —
# see a2a/server/agent_execution/active_task.py validation site. Synth-E2E
# 2026-05-03T11:00:34Z surfaced this. The fix routes the terminal Message
# through TaskUpdater.complete()/failed() which wrap it in a
# TaskStatusUpdateEvent. Both tests below pin that path so the regression
# can't recur (raw enqueue at the terminal step would NOT touch
# event_queue._complete_calls / _failed_calls).
@pytest.mark.asyncio
async def test_terminal_success_routes_via_updater_complete():
"""A successful run must terminate via updater.complete(message=...) —
raw event_queue.enqueue_event(Message) crashes the v1 SDK in task mode."""
agent = MagicMock()
agent.astream_events = MagicMock(return_value=_stream(_text_chunk("Hello")))
executor = LangGraphA2AExecutor(agent)
part = MagicMock()
part.text = "Hi"
context = _make_context([part], "ctx-term-ok", task_id="task-term-ok")
context.current_task = None # forces task-mode (Task gets enqueued)
eq = _make_event_queue()
# Pre-init real lists so the AsyncMock event_queue doesn't auto-spec
# _complete_calls/_failed_calls into child MagicMocks. The conftest
# TaskUpdater stub appends to these lists when complete/failed fire.
eq._complete_calls = []
eq._failed_calls = []
await executor.execute(context, eq)
assert eq._complete_calls, (
"terminal Message must route via updater.complete() in task mode — "
"raw event_queue.enqueue_event(Message) is rejected by a2a-sdk v1"
)
final_msg = eq._complete_calls[-1]
assert "Hello" in str(final_msg)
@pytest.mark.asyncio
async def test_terminal_error_routes_via_updater_failed():
"""An agent crash must terminate via updater.failed(message=...) — raw
enqueue in task mode hits the same v1 contract violation."""
async def _error_stream(*args, **kwargs):
raise RuntimeError("model crashed")
yield # pragma: no cover — makes this an async generator
agent = MagicMock()
agent.astream_events = MagicMock(return_value=_error_stream())
executor = LangGraphA2AExecutor(agent)
part = MagicMock()
part.text = "Break things"
context = _make_context([part], "ctx-term-err", task_id="task-term-err")
context.current_task = None # forces task-mode
eq = _make_event_queue()
eq._complete_calls = []
eq._failed_calls = []
await executor.execute(context, eq)
assert eq._failed_calls, (
"terminal error Message must route via updater.failed() in task mode"
)
err_msg = eq._failed_calls[-1]
assert "model crashed" in str(err_msg)
# And complete() must NOT have been called on the failure path.
assert not eq._complete_calls, (
"complete() should not fire when execute() raises"
)
+62
View File
@@ -462,6 +462,68 @@ def test_envelope_enrichment_negative_caches_network_exception(_reset_peer_metad
assert cached[1] is None
def test_envelope_enrichment_negative_caches_non_json_200(_reset_peer_metadata_cache):
"""HTTP 200 but the body isn't JSON (registry returns HTML, an empty
string, or a partial response): ``response.json()`` raises. The
enrichment block must absorb the exception, write the negative-cache
entry, and never re-fetch this peer until TTL elapses.
Without this contract a registry that mistakenly returns a non-JSON
200 (proxy injecting an HTML error page; partial response from a
flapping pod) would re-fire the 2s-bounded GET on every push for
that peer — same DoS-on-self pattern the 5xx negative-cache test
pins. #2483.
"""
import json as _json
import a2a_client
from a2a_mcp_server import _build_channel_notification
# 200 OK shape but .json() raises. side_effect overrides the
# _make_httpx_response default of `return_value` so the helper can
# stay shape-stable for callers that DO want a JSON body.
resp = _make_httpx_response(200, {})
resp.json.side_effect = _json.JSONDecodeError("not json", "<html>", 0)
p, client = _patch_httpx_client(resp)
with p:
_build_channel_notification({"peer_id": _PEER_UUID, "kind": "peer_agent", "text": "first"})
_build_channel_notification({"peer_id": _PEER_UUID, "kind": "peer_agent", "text": "second"})
assert client.get.call_count == 1, (
f"non-JSON 200 must be negative-cached, got {client.get.call_count} GETs"
)
cached = a2a_client._peer_metadata[_PEER_UUID]
assert cached[1] is None, "negative cache stores None as the record"
def test_envelope_enrichment_negative_caches_non_dict_json_200(_reset_peer_metadata_cache):
"""HTTP 200, valid JSON, but the body is a list / string / number /
null instead of the expected dict. ``isinstance(record, dict)``
skips enrichment but the call must still write to the negative
cache so a second push doesn't re-fetch.
Pins behaviour for a registry that mistakenly returns
``[{"id": ...}, ...]`` (collection shape) or just ``null`` (no-record
sentinel) — both should land at the same negative-cache outcome as a
5xx or a non-JSON 200. #2483.
"""
import a2a_client
from a2a_mcp_server import _build_channel_notification
p, client = _patch_httpx_client(
_make_httpx_response(200, ["not", "a", "dict"]),
)
with p:
_build_channel_notification({"peer_id": _PEER_UUID, "kind": "peer_agent", "text": "first"})
_build_channel_notification({"peer_id": _PEER_UUID, "kind": "peer_agent", "text": "second"})
assert client.get.call_count == 1, (
f"non-dict JSON 200 must be negative-cached, got {client.get.call_count} GETs"
)
cached = a2a_client._peer_metadata[_PEER_UUID]
assert cached[1] is None, "negative cache stores None as the record"
def test_envelope_enrichment_re_fetches_after_ttl(_reset_peer_metadata_cache):
"""Cached entry past TTL: registry is hit again. Pin the TTL
behaviour so a future caller bumping ``_PEER_METADATA_TTL_SECONDS``
+49
View File
@@ -286,6 +286,55 @@ def test_required_env_empty_list_passes(tmp_path):
assert report.ok is True
def test_required_env_skipped_in_smoke_mode(tmp_path, monkeypatch):
"""MOLECULE_SMOKE_MODE=1 demotes Required-env failures to warnings.
Boot smoke (issue #2275) exercises executor.execute() against stub
deps and never hits the real provider, so missing auth env is not
a real blocker. Without this bypass, every adapter that introduces
a new auth env var (HERMES_API_KEY, OPENROUTER_API_KEY, etc.)
would silently break the publish-image gate until molecule-ci's
fake-env list catches up — the 2026-05-03 hermes outage. The
warning still surfaces in the report so unset env doesn't go
completely silent.
"""
monkeypatch.delenv("HERMES_API_KEY", raising=False)
monkeypatch.setenv("MOLECULE_SMOKE_MODE", "1")
config = make_config(
runtime_config=RuntimeConfig(required_env=["HERMES_API_KEY"]),
)
report = run_preflight(config, str(tmp_path))
assert report.ok is True
assert any(
issue.title == "Required env" and "HERMES_API_KEY" in issue.detail
for issue in report.warnings
), "smoke-mode bypass should still warn so unset env stays visible"
assert not any(
issue.title == "Required env" for issue in report.failures
)
def test_required_env_smoke_mode_off_still_fails(tmp_path, monkeypatch):
"""Sanity: smoke bypass is OFF when MOLECULE_SMOKE_MODE is unset."""
monkeypatch.delenv("HERMES_API_KEY", raising=False)
monkeypatch.delenv("MOLECULE_SMOKE_MODE", raising=False)
config = make_config(
runtime_config=RuntimeConfig(required_env=["HERMES_API_KEY"]),
)
report = run_preflight(config, str(tmp_path))
assert report.ok is False
assert any(
issue.title == "Required env" and "HERMES_API_KEY" in issue.detail
for issue in report.failures
)
# ---------- Per-model required_env (models[] override) ----------