test(e2e): gate fresh-provision peer-visibility via literal MCP list_peers #1298
Reference in New Issue
Block a user
Delete Branch "e2e/peer-visibility-mcp-gate-v2"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Codifies the literal user-facing peer-visibility path as an automated staging-E2E gate so it can never silently regress.
Hermes and OpenClaw were repeatedly reported "fleet-verified / cascade-complete" because the proxy signals were green — registry registration + heartbeat (Hermes), model round-trip 200 (OpenClaw). But a freshly-provisioned workspace asked on canvas "can you see your peers" actually FAILS:
list_peerscallsessions_listfallback, sees no platform peersTasks #142/#159 were even marked "completed" under this same proxy-verification flaw. This PR makes the literal call an objective, non-bypassable gate.
(Reopened from #1297 — that PR was based on
staging; molecule-core is trunk-based perfeedback_agents_target_staging_defaultso base ismain. Identical 2-file diff, cherry-picked ontoorigin/main.)What the assertion actually drives (proof it is NOT a proxy)
tests/e2e/test_peer_visibility_mcp_staging.sh:POST /cp/admin/orgs) — the same path a user's "deploy a workspace" click takes.hermes,openclaw,claude-code) under a shared parent.{"jsonrpc":"2.0","id":1,"method":"tools/call","params":{"name":"list_peers","arguments":{}}}to
POST /workspaces/:id/mcpusing that workspace's own bearer token, through the realWorkspaceAuth+MCPRateLimitermiddleware chain (workspace-server/internal/router/router.go:446,mcp.godispatch →toolListPeers). This is the exact callmcp_molecule_list_peersmakes from a canvas agent.result(not anerrorobject) AND the returned peer text literally contains the other provisioned sibling workspace IDs — not an empty list, not a native-sessions_listfallback (the OpenClaw symptom is explicitly pattern-detected and failed).It does not read a registry row,
/health, the heartbeat table, orGET /registry/:id/peers. (For contrast: the pre-existingtests/e2e/test_2307_peer_visibility_staging.shandtest_staging_full_saas.shstep 9b only checkGET /registry/:id/peersHTTP code — that registry-row proxy is exactly what made the broken runtimes look "verified".)Design decision — new workflow vs extend
New dedicated
e2e-peer-visibility.ymlrather than folding intoe2e-staging-saas.yml, because:e2e-staging-saas.ymlis single-runtime-per-run (E2E_RUNTIME); a multi-runtime matrix would conflate concerns and bloat its already-45-min run.Teardown scoping
Scoped to only the
e2e-pv-<run_id>org this run created —DELETE /cp/admin/tenants/$SLUGwith the{"confirm":$SLUG}fat-finger guard. Three nested nets, none cluster-wide (honorsfeedback_cleanup_after_each_test,feedback_never_run_cluster_cleanup_tests_on_live_platform):EXIT/INT/TERMtrap.always()step, filtered to this run'se2e-pv-<date>-<run_id>-prefix (today + yesterday UTC for midnight-crossing runs).sweep-stale-e2e-orgsfinal net (slug starts withe2e-).Required-vs-not + flip tracking
Landed NON-required (not added to
branch_protections/mainstatus_check_contexts; verified locally vialint-required-no-paths.py— onlyCI / all-required+sop-checklist / all-items-ackedare required). Rationale:continue-on-errormask — that would defeat its entire purpose long-term (feedback_fix_root_not_symptom). It is an honest, visible, red, non-required signal that goes green only when the fixes actually land.paths:filter perfeedback_path_filtered_workflow_cant_be_requiredbefore flipping).The
pr-validatejob shares theE2E Peer Visibilitycheck name (provene2e-staging-saas.ymlshape) so the context is already flip-to-required-ready and a workflow-only PR is never silently statusless.Gitea-1.22.6 / act_runner hardening
actions/checkoutSHAde0fac2e...— the onee2e-staging-canvas.ymluses successfully (feedback_gitea_cross_repo_uses_blocked; re #1277/PR#1292 unmirrored-SHA root-cause).feedback_concurrency_group_per_sha).GITHUB_SERVER_URL(feedback_act_runner_github_server_url).uses:.lint-workflow-yaml(0 warnings),lint-continue-on-error-tracking,lint-required-no-pathslocally on themaintree.Test plan
bash -non the driving scriptlint-workflow-yaml.py --workflow-dir— 54 files, 0 fatal, 0 warningslint-continue-on-error-tracking.py— pass (zerocontinue-on-errorin the new file by design)lint-required-no-paths.py(with token, BRANCH=main) — pass; confirms new context is NOT required so thepaths:filter is safe to landpeer-visibilityrun onmainafter merge is expected RED (Hermes-401 / OpenClaw-MCP-wiring not yet fixed) — this is the gate working as designedRefs: #1296
🤖 Generated with Claude Code
Five-axis review (genuine non-author, qa-team).
Correctness: driving script issues the byte-for-byte JSON-RPC tools/call name=list_peers envelope to POST /workspaces/:id/mcp with each workspace OWN bearer through the real WorkspaceAuth+MCPRateLimiter chain; asserts HTTP 200 + JSON-RPC result + expected sibling peer-ID set; correctly distinguishes 401 (Hermes), native sessions_list fallback (OpenClaw), missing/empty peer set. Exit 10 = designed regression. Not a proxy (no registry row/health/heartbeat read).
Safety/teardown: scoped DELETE of only e2e-pv-- slug with {confirm} fat-finger guard + EXIT/INT/TERM trap; workflow always() safety-net sweeps only this run slug (today+yesterday for midnight crossing); sweep-stale-e2e-orgs outer net. Honors never-run-cluster-cleanup-on-live-platform.
CI design: intentionally non-required (RED on todays broken Hermes/OpenClaw = gate working; flip-to-required tracked in #1296); honest gate, no continue-on-error mask; per-SHA concurrency; mirrored checkout SHA; no cross-repo uses; GITHUB_SERVER_URL pinned; pr-validate posts status under the check name so a workflow-only PR is not silently statusless.
Scope: exactly 2 new files, zero deletions, branch protection untouched.
Approved.
[core-security-agent] N/A — E2E infrastructure. New e2e-peer-visibility.yml workflow + test_peer_visibility_mcp_staging.sh. Tests literal MCP list_peers against real staging endpoints using each workspace's own bearer token. All curl calls: hardcoded staging-api.moleculesai.app URLs, RPC_BODY is hardcoded JSON ({"jsonrpc":"2.0","method":"tools/call","params":{"name":"list_peers"}}), tokens from platform API response. No exec from user input. No security surface.
Re-approve after empty-commit CI re-trigger. Prior run did NOT real-red this PR: all-required aggregator (task 107802) was starved by 29+ consecutive act_runner->Gitea API read-timeouts during the workflow storm and never observed any required context as failure; CI/Platform(Go) status never propagated to the API in any successful poll; Platform(Go) go test -race -timeout 10m blew its budget on a contention-starved shared runner (log-stream died 06:29:35, job ran to 06:42:08). PR adds ZERO Go code (only the 2 new files), so a Platform(Go) timeout cannot be caused by this PR — it is infra/runner+API saturation. Empty no-op commit; tree unchanged; diff still exactly .gitea/workflows/e2e-peer-visibility.yml + tests/e2e/test_peer_visibility_mcp_staging.sh. Five-axis review unchanged from review 4014 (genuine non-author core-qa; honest non-required gate; scoped teardown; branch protection untouched, flip tracked in #1296). Approved.