Compare commits

...

24 Commits

Author SHA1 Message Date
molecule-ai[bot] 46c8c1de23 Merge pull request #2602 from Molecule-AI/staging
staging → main: auto-promote 6d38b96
2026-05-03 16:49:40 +00:00
Hongming Wang 6d38b96043 Merge pull request #2601 from Molecule-AI/fix/2483-negative-cache-branch-tests
test(envelope-enrichment): pin negative-cache for non-JSON 200 + non-dict JSON 200 (#2483)
2026-05-03 16:37:30 +00:00
Hongming Wang 270a95aa67 test(envelope-enrichment): pin negative-cache for non-JSON 200 + non-dict JSON 200 (#2483)
The two missing branch tests called out by the multi-axis review of #2471.

a2a_client.enrich_peer_metadata handles two failure shapes (lines 105-112)
that the existing 12 envelope-enrichment tests don't exercise:

  1. HTTP 200, response.json() raises (non-JSON body)
  2. HTTP 200, valid JSON, but body is list/string/number not dict

Both paths land at the negative-cache write, but no test verified the
discriminator. Pin both with the same call_count == 1 assertion shape
the 5xx + network-exception tests already use.

Verified: temporarily removing the negative-cache write in either
branch makes the corresponding test fail with call_count == 2 — the
assertion correctly discriminates the contract from a fall-through.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:35:21 -07:00
Hongming Wang 6431bdc631 Merge pull request #2600 from Molecule-AI/auto-sync/main-72b6be82
chore: sync main → staging (auto, ff to 72b6be82)
2026-05-03 16:23:16 +00:00
molecule-ai[bot] 72b6be82b0 Merge pull request #2599 from Molecule-AI/staging
staging → main: auto-promote b425995
2026-05-03 09:18:48 -07:00
Hongming Wang b42599585e Merge pull request #2598 from Molecule-AI/fix/auto-promote-skip-empty-tree
fix(auto-promote): skip empty-tree promotes to break perpetual cycle
2026-05-03 15:59:05 +00:00
Hongming Wang 06bfed2e35 Merge pull request #2597 from Molecule-AI/auto-sync/main-d1eab79d
chore: sync main → staging (auto, ff to d1eab79d)
2026-05-03 15:57:47 +00:00
Hongming Wang 80b38900de fix(auto-promote): skip empty-tree promotes to break perpetual cycle
The auto-promote ↔ auto-sync chain has been generating empty PRs
indefinitely since the staging merge_queue ruleset uses MERGE
strategy:

1. Auto-promote merges PR via queue → main = merge commit M2 not in staging
2. Auto-sync opens sync-back PR. Workflow's local `git merge --ff-only`
   succeeds (PR title even says "ff to ..."), but the queue lands the
   PR via MERGE → staging = merge commit S2 not in main
3. Auto-promote sees staging ahead by 1 → opens new promote PR. Tree
   diff vs main = 0 (S2's tree == main's tree). But the gate logic
   only checks "all required workflows green", not "actual code to
   ship" → opens an empty promote PR
4. ... repeat indefinitely

Each round costs ~30-40 min wallclock, ~2 manual approvals (the queue
requires 1 review and the bot can't self-approve without admin
bypass), and one full CodeQL Go run (~15 min).

Observed today (2026-05-03) across PRs #2592#2594#2595#2596#2597 — 5 PRs, ~3 hours, all empty content.

Fix: before opening the promote PR, check that staging's tree
actually differs from main's tree. If they're identical (the
empty-merge-commit cycle), skip cleanly and let the cycle terminate.

Implementation:
- New step `Skip if staging tree == main tree` runs before the
  existing gate check.
- `git diff --quiet origin/main $HEAD_SHA` exits 0 iff trees match.
- On match: emits a step summary explaining the skip + sets
  `skip=true`; subsequent gate-check + promote steps are gated on
  `skip != 'true'` so they short-circuit.
- Fail-open: if `git fetch` errors, fall through to gate check
  (preserve existing behavior). Only skip when diff is DEFINITIVELY
  empty.

Long-term, the cleaner fix is to switch the merge_queue ruleset's
merge_method away from MERGE so FF-able PRs land cleanly without a
new commit — but that's a broader change affecting every staging
PR's commit shape. This guard is the surgical one-step break.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 08:56:44 -07:00
molecule-ai[bot] d1eab79d28 Merge pull request #2596 from Molecule-AI/staging
staging → main: auto-promote 824a2a7
2026-05-03 15:50:12 +00:00
Hongming Wang 824a2a7657 Merge pull request #2595 from Molecule-AI/auto-sync/main-876d6ec8
chore: sync main → staging (auto, ff to 876d6ec8)
2026-05-03 15:38:22 +00:00
molecule-ai[bot] 876d6ec8c9 Merge pull request #2594 from Molecule-AI/staging
staging → main: auto-promote 63e3d38
2026-05-03 08:33:51 -07:00
Hongming Wang 63e3d385d6 Merge pull request #2592 from Molecule-AI/auto-sync/main-2e78812f
chore: sync main → staging (auto, ff to 2e78812f)
2026-05-03 15:15:01 +00:00
molecule-ai[bot] 2e78812ff9 Merge pull request #2591 from Molecule-AI/staging
staging → main: auto-promote 19cc833
2026-05-03 15:04:00 +00:00
Hongming Wang 9664d66e4b Merge branch 'main' into staging 2026-05-03 07:48:31 -07:00
Hongming Wang 19cc83313a Merge pull request #2589 from Molecule-AI/fix/retarget-skip-staging-head
fix(retarget): skip PRs whose head is staging (auto-promote PRs)
2026-05-03 14:36:44 +00:00
molecule-ai[bot] 097d513b65 Merge pull request #2588 from Molecule-AI/staging
staging → main: auto-promote c45aa8d
2026-05-03 07:35:05 -07:00
Hongming Wang 2b3f44c3c8 fix(retarget): skip PRs whose head is staging (auto-promote PRs)
The retarget-main-to-staging workflow tries to PATCH base=staging on
every bot-authored PR opened against main. Auto-promote staging→main
PRs have head=staging, base=main — retargeting them sets head AND
base to staging, which GitHub rejects with HTTP 422 "no new commits
between base 'staging' and head 'staging'".

This started surfacing on PR #2588 (2026-05-03 14:30) once #2586
switched the auto-promote workflow to an App token. Before #2586
the auto-promote PR was authored by github-actions[bot], which the
retarget filter happened to skip; now it's molecule-ai[bot], which
passes the bot filter and triggers the broken retarget attempt.

Add a head-ref != 'staging' guard so auto-promote PRs short-circuit
before the PATCH. The existing 422 "duplicate base" detector is
left alone — it covers a different operational case.
2026-05-03 07:34:24 -07:00
Hongming Wang c45aa8d7ee Merge pull request #2587 from Molecule-AI/auto-sync/main-b4e45374
chore: sync main → staging (auto, ff to b4e45374)
2026-05-03 14:19:28 +00:00
Hongming Wang b4e45374bf Merge pull request #2586 from Molecule-AI/fix/auto-promote-app-token
fix(auto-promote): use App token for auto-merge to fire downstream cascade (#2357)
2026-05-03 07:15:31 -07:00
Hongming Wang f2d69f0088 Merge pull request #2585 from Molecule-AI/fix/canvas-loading-state-aria
fix(canvas): add role=status + aria-live to remaining loading states
2026-05-03 14:14:33 +00:00
Hongming Wang bc11ed8a2b fix(auto-promote): use App token for auto-merge to fire downstream cascade (#2357)
GITHUB_TOKEN-initiated merges suppress the downstream `push` event on
main per GitHub's documented limitation:
  https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow

Result before this fix: every staging→main promote landed silently —
publish-workspace-server-image, canary-verify, and redeploy-tenants-on-main
all stayed dark. The polling tail was the SOLE cascade trigger; if it
ever 30-min-timed-out the chain dead-locked invisibly.

Symptom (from the issue body, 2026-04-30):

| Time     | Event                                            | Triggered? |
|----------|--------------------------------------------------|-----------|
| 05:48:04 | Promote PR #2352 merged (c140ad28)               | No fired  |
| 06:07:29 | Promote PR #2356 merged (5973c9bd)               | No fired  |

Fix: mint the molecule-ai App token BEFORE the promote-PR step and
hand it to the auto-merge call. App-token-initiated merges DO trigger
downstream workflow_run cascades.

The polling tail stays as defense-in-depth (with comments updated):
once we've observed >=10 successful natural cascades it can be dropped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 07:13:26 -07:00
Hongming Wang e2328abedc fix(canvas): add role=status + aria-live to remaining loading states
Three loading-state divs were missing the role/aria pattern that
TemplatePalette.tsx and EmptyState.tsx already follow. Screen readers
get no signal that the page is waiting:

- canvas/src/app/page.tsx — full-screen "Loading canvas..." while
  the websocket hydrates. First paint of the entire app.
- canvas/src/components/settings/TokensTab.tsx — "Loading tokens..."
- canvas/src/components/settings/OrgTokensTab.tsx — "Loading keys..."

Add role="status" + aria-live="polite" to the wrapping div so
assistive tech announces the wait and the eventual transition.
Visual rendering unchanged.
2026-05-03 07:11:48 -07:00
github-actions[bot] bdad75ae3e Merge pull request #2582 from Molecule-AI/staging
staging → main: auto-promote 90ba2cd
2026-05-03 07:06:58 -07:00
Hongming Wang 90ba2cd4df Merge pull request #2580 from Molecule-AI/auto-sync/main-b002247f
chore: sync main → staging (auto, ff to b002247f)
2026-05-03 13:54:03 +00:00
6 changed files with 174 additions and 51 deletions
+94 -44
View File
@@ -111,7 +111,60 @@ jobs:
all_green: ${{ steps.gates.outputs.all_green }}
head_sha: ${{ steps.gates.outputs.head_sha }}
steps:
# Skip empty-tree promotes (the perpetual auto-promote↔auto-sync cycle
# observed 2026-05-03). Sequence: auto-promote merges via the staging
# merge-queue's MERGE strategy, creating a merge commit on main that
# staging doesn't have. auto-sync then merges main back into staging
# via another merge commit (the queue's MERGE strategy applies on
# the staging side too, even when the workflow's local FF would
# have sufficed). Now staging has a new merge-commit SHA whose
# tree == main's tree — but auto-promote sees "staging ahead of
# main by 1" and opens YET another empty promote PR. Each round
# costs ~30-40 min wallclock, ~2 manual approvals, and burns a
# full CodeQL Go run (~15 min). Without this guard the cycle
# repeats indefinitely.
#
# Long-term fix is to switch the merge_queue ruleset's
# `merge_method` away from MERGE so FF-able PRs land cleanly,
# but that's a broader change affecting every staging PR's
# commit shape. This guard is the one-line surgical fix that
# breaks the cycle without touching merge-queue config.
#
# Fail-open: if `git diff` errors for any reason, fall through
# to the gate check (preserve existing behavior). Only skip
# when the diff is DEFINITIVELY empty.
- name: Checkout for tree-diff check
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
ref: staging
- name: Skip if staging tree == main tree (perpetual-cycle break)
id: tree-diff
env:
HEAD_SHA: ${{ github.event.workflow_run.head_sha || github.sha }}
run: |
set -eu
git fetch origin main --depth=50 || { echo "::warning::git fetch main failed — proceeding (fail-open)"; exit 0; }
# Compare staging tip's tree against main's tree. `git diff
# --quiet` exits 0 if no differences, 1 if there are.
if git diff --quiet origin/main "$HEAD_SHA" -- 2>/dev/null; then
{
echo "## ⏭ Skipped — no code to promote"
echo
echo "staging tip (\`${HEAD_SHA:0:8}\`) and \`main\` have identical trees."
echo "This is the auto-promote↔auto-sync merge-commit cycle: staging has a"
echo "new SHA (a sync-back merge commit) but the underlying file tree is"
echo "already on main, so there's no real code to ship."
echo
echo "Skipping to avoid opening an empty promote PR. Cycle terminates here."
} >> "$GITHUB_STEP_SUMMARY"
echo "::notice::auto-promote: staging tree == main tree — no code to promote, skipping"
echo "skip=true" >> "$GITHUB_OUTPUT"
else
echo "skip=false" >> "$GITHUB_OUTPUT"
fi
- name: Check all required gates on this SHA
if: steps.tree-diff.outputs.skip != 'true'
id: gates
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
@@ -209,10 +262,25 @@ jobs:
exit 0
fi
# Mint the App token BEFORE the promote-PR step so the auto-merge
# call can use it. GITHUB_TOKEN-initiated merges suppress the
# downstream `push` event on main, breaking the
# publish-workspace-server-image → canary-verify → redeploy-tenants
# chain (issue #2357). Using the App token here means the
# merge-queue-landed merge IS able to fire the cascade naturally;
# the polling tail below stays as defense-in-depth.
- name: Mint App token for promote-PR + downstream dispatch
if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
id: app-token
uses: actions/create-github-app-token@1b10c78c7865c340bc4f6099eb2f838309f1e8c3 # v3.1.1
with:
app-id: ${{ secrets.MOLECULE_AI_APP_ID }}
private-key: ${{ secrets.MOLECULE_AI_APP_PRIVATE_KEY }}
- name: Open (or reuse) staging → main promote PR + enable auto-merge
if: ${{ vars.AUTO_PROMOTE_ENABLED == 'true' || github.event.inputs.force == 'true' }}
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GH_TOKEN: ${{ steps.app-token.outputs.token }}
REPO: ${{ github.repository }}
TARGET_SHA: ${{ needs.check-all-gates-green.outputs.head_sha }}
run: |
@@ -267,52 +335,34 @@ jobs:
echo "promote_pr_num=${PR_NUM}" >> "$GITHUB_OUTPUT"
id: promote_pr
# Mint a short-lived GitHub App installation token for the dispatch
# step below. We CANNOT use `secrets.GITHUB_TOKEN` to dispatch the
# downstream publish chain — workflow runs created by GITHUB_TOKEN
# do not fire `workflow_run` triggers on completion (the
# documented "no recursion" rule —
# https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
#
# Symptom this caused (root-caused on 2026-04-30): publish-image
# ran successfully twice (21313dc 14:41Z, 59dec57 15:21Z) but
# canary-verify and redeploy-tenants-on-main never chained,
# because the publish run's `triggering_actor` was
# `github-actions[bot]` (i.e. GITHUB_TOKEN). A manual dispatch
# earlier in the day with the operator's PAT (d850ec7 06:52Z) did
# chain — same workflow file, only the actor differed.
#
# An App token's triggering_actor is the App user (e.g.
# `molecule-ai[bot]`), which IS allowed to fire downstream
# workflow_run cascades.
- name: Mint App token for downstream dispatch
if: steps.promote_pr.outputs.promote_pr_num != ''
id: app-token
uses: actions/create-github-app-token@1b10c78c7865c340bc4f6099eb2f838309f1e8c3 # v3.1.1
with:
app-id: ${{ secrets.MOLECULE_AI_APP_ID }}
private-key: ${{ secrets.MOLECULE_AI_APP_PRIVATE_KEY }}
# The App token minted above (before the promote-PR step) is
# also used by the polling tail below. Defense-in-depth: with
# the merge-queue-landed merge now using the App token, the
# main-branch push event SHOULD fire the publish/canary/redeploy
# cascade naturally — but if for any reason it doesn't (e.g. an
# unrelated event-suppression edge case), the explicit dispatches
# below still wake the chain.
- name: Wait for promote merge, then dispatch publish + redeploy (#2357)
# GITHUB_TOKEN-initiated merges suppress downstream `push` events
# (https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
# Result: when the merge queue lands the promote PR, the resulting
# main-branch push DOES NOT fire publish-workspace-server-image,
# so canary-verify and redeploy-tenants-on-main never run and
# tenants stay on stale code (issue #2357).
# Defense-in-depth dispatch. With the auto-merge call above
# now using the App token (this commit), the merge-queue-landed
# merge SHOULD fire publish-workspace-server-image naturally
# via on:push:[main] — App-token-initiated pushes DO trigger
# workflow_run cascades, unlike GITHUB_TOKEN-initiated ones
# (the documented "no recursion" rule —
# https://docs.github.com/en/actions/using-workflows/triggering-a-workflow#triggering-a-workflow-from-a-workflow).
#
# Workaround: poll for the merge to land, then explicitly
# `gh workflow run` publish-workspace-server-image. The dispatch
# MUST authenticate as the molecule-ai App (App token minted
# above) — not GITHUB_TOKEN — so that the resulting publish
# run's completion event can fire the workflow_run cascade
# into canary-verify + redeploy-tenants-on-main. See the prior
# step's comment for the GITHUB_TOKEN no-recursion details.
# This explicit dispatch stays as belt-and-suspenders for any
# edge case where the natural cascade misfires. If it never
# observably fires after this token swap (i.e. the publish
# workflow has already started by the time we get here), the
# second dispatch is a harmless no-op (publish-workspace-server-image
# has its own concurrency group that dedupes).
#
# Long-term fix: switch the auto-merge call above to use the
# same App token, so the merge's push event fires
# publish-workspace-server-image naturally and this polling tail
# becomes unnecessary. Tracked in #2357.
# See PR for #2357: pre-fix the merge action was via
# GITHUB_TOKEN, suppressing the cascade and forcing this tail
# to be the SOLE chain trigger. With the auto-merge token swap
# the tail becomes redundant in the happy path; keep until
# we've observed >=10 successful natural cascades, then drop.
if: steps.promote_pr.outputs.promote_pr_num != ''
env:
GH_TOKEN: ${{ steps.app-token.outputs.token }}
+15 -4
View File
@@ -26,11 +26,22 @@ jobs:
runs-on: ubuntu-latest
# Only fire for bot-authored PRs. Human CEO PRs (staging→main promotion)
# are intentional and pass through.
#
# Head-ref guard: never retarget a PR whose head IS `staging` — those
# are the auto-promote staging→main PRs (opened by molecule-ai[bot]
# since #2586 switched to an App token, which now passes the bot
# filter below). Retargeting head=staging onto base=staging fails
# with HTTP 422 "no new commits between base 'staging' and head
# 'staging'", which used to surface as a noisy red workflow run on
# every auto-promote (caught 2026-05-03 on PR #2588).
if: >-
github.event.pull_request.user.type == 'Bot'
|| endsWith(github.event.pull_request.user.login, '[bot]')
|| github.event.pull_request.user.login == 'app/molecule-ai'
|| github.event.pull_request.user.login == 'molecule-ai[bot]'
github.event.pull_request.head.ref != 'staging'
&& (
github.event.pull_request.user.type == 'Bot'
|| endsWith(github.event.pull_request.user.login, '[bot]')
|| github.event.pull_request.user.login == 'app/molecule-ai'
|| github.event.pull_request.user.login == 'molecule-ai[bot]'
)
steps:
- name: Retarget PR base to staging
id: retarget
+1 -1
View File
@@ -54,7 +54,7 @@ export default function Home() {
if (hydrating) {
return (
<div className="fixed inset-0 flex items-center justify-center bg-surface">
<div className="flex flex-col items-center gap-3">
<div role="status" aria-live="polite" className="flex flex-col items-center gap-3">
<Spinner size="lg" />
<span className="text-xs text-ink-soft">Loading canvas...</span>
</div>
@@ -182,7 +182,7 @@ export function OrgTokensTab() {
{/* Token list */}
{loading ? (
<div className="flex items-center justify-center gap-2 py-6 text-ink-soft text-xs">
<div role="status" aria-live="polite" className="flex items-center justify-center gap-2 py-6 text-ink-soft text-xs">
<Spinner /> Loading keys...
</div>
) : tokens.length === 0 ? (
+1 -1
View File
@@ -129,7 +129,7 @@ export function TokensTab({ workspaceId }: TokensTabProps) {
{/* Token list */}
{loading ? (
<div className="flex items-center justify-center gap-2 py-6 text-ink-soft text-xs">
<div role="status" aria-live="polite" className="flex items-center justify-center gap-2 py-6 text-ink-soft text-xs">
<Spinner /> Loading tokens...
</div>
) : tokens.length === 0 ? (
+62
View File
@@ -462,6 +462,68 @@ def test_envelope_enrichment_negative_caches_network_exception(_reset_peer_metad
assert cached[1] is None
def test_envelope_enrichment_negative_caches_non_json_200(_reset_peer_metadata_cache):
"""HTTP 200 but the body isn't JSON (registry returns HTML, an empty
string, or a partial response): ``response.json()`` raises. The
enrichment block must absorb the exception, write the negative-cache
entry, and never re-fetch this peer until TTL elapses.
Without this contract a registry that mistakenly returns a non-JSON
200 (proxy injecting an HTML error page; partial response from a
flapping pod) would re-fire the 2s-bounded GET on every push for
that peer — same DoS-on-self pattern the 5xx negative-cache test
pins. #2483.
"""
import json as _json
import a2a_client
from a2a_mcp_server import _build_channel_notification
# 200 OK shape but .json() raises. side_effect overrides the
# _make_httpx_response default of `return_value` so the helper can
# stay shape-stable for callers that DO want a JSON body.
resp = _make_httpx_response(200, {})
resp.json.side_effect = _json.JSONDecodeError("not json", "<html>", 0)
p, client = _patch_httpx_client(resp)
with p:
_build_channel_notification({"peer_id": _PEER_UUID, "kind": "peer_agent", "text": "first"})
_build_channel_notification({"peer_id": _PEER_UUID, "kind": "peer_agent", "text": "second"})
assert client.get.call_count == 1, (
f"non-JSON 200 must be negative-cached, got {client.get.call_count} GETs"
)
cached = a2a_client._peer_metadata[_PEER_UUID]
assert cached[1] is None, "negative cache stores None as the record"
def test_envelope_enrichment_negative_caches_non_dict_json_200(_reset_peer_metadata_cache):
"""HTTP 200, valid JSON, but the body is a list / string / number /
null instead of the expected dict. ``isinstance(record, dict)``
skips enrichment but the call must still write to the negative
cache so a second push doesn't re-fetch.
Pins behaviour for a registry that mistakenly returns
``[{"id": ...}, ...]`` (collection shape) or just ``null`` (no-record
sentinel) — both should land at the same negative-cache outcome as a
5xx or a non-JSON 200. #2483.
"""
import a2a_client
from a2a_mcp_server import _build_channel_notification
p, client = _patch_httpx_client(
_make_httpx_response(200, ["not", "a", "dict"]),
)
with p:
_build_channel_notification({"peer_id": _PEER_UUID, "kind": "peer_agent", "text": "first"})
_build_channel_notification({"peer_id": _PEER_UUID, "kind": "peer_agent", "text": "second"})
assert client.get.call_count == 1, (
f"non-dict JSON 200 must be negative-cached, got {client.get.call_count} GETs"
)
cached = a2a_client._peer_metadata[_PEER_UUID]
assert cached[1] is None, "negative cache stores None as the record"
def test_envelope_enrichment_re_fetches_after_ttl(_reset_peer_metadata_cache):
"""Cached entry past TTL: registry is hit again. Pin the TTL
behaviour so a future caller bumping ``_PEER_METADATA_TTL_SECONDS``