os.Chmod(dst, 0o555) silently passes when os.Geteuid() == 0 because
root bypasses POSIX permission checks. A previous attempt to use a
symlink to /dev/full also fails: Go's os.MkdirAll resolves the symlink
during path traversal and the kernel allows mkdir("/dev/full") as a
device-table entry — io.Copy to /dev/full then succeeds with 0 bytes
written and returns nil.
The honest, consistent fix mirrors TestLocalResolver_CopyFileSourceUnreadable:
skip when running as root. The write-failure propagation logic is
exercised correctly in non-root CI environments.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(template_import): Remove silent template-dir fallback in ReplaceFiles offline path
When the workspace container is offline and writeViaEphemeral fails
(docker unavailable), ReplaceFiles previously fell back to writing
to the host-side template directory. This silently returned 200 with
"source: template" while the file change was invisible after restart
because the restart handler reads from the Docker volume, not the
template dir (issue #151).
Now returns 503 Service Unavailable with a message telling the caller
to retry after the workspace starts. The ephemeral write path is
the only correct mechanism for offline-container updates.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mark PR #175 (keyboard shortcuts dialog) as ✅ done.
Note that screen reader announcements (HIGH) is in progress by Core-FE.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #160: workspace tests fail when MOLECULE_WORKSPACE_TOKEN is set in
the test environment (or when /configs/.auth_token exists on disk, as it
does in a container CI runner).
Root cause:
- test_resolve_token_returns_none_when_missing: monkeypatch.delenv()
removes the env var, but _resolve_token() falls through to
configs_dir.resolve()/.auth_token which exists in the container.
- Multi-workspace tests: clear_cache() resets _cached_token, but
get_token() immediately re-reads /configs/.auth_token and caches
the real token before the env var is even checked.
Fix:
- test_mcp_doctor: patch configs_dir.resolve() to return a bare tmp_path
so the disk-file fallback finds nothing.
- Multi-workspace tests: patch platform_auth._token_file() to return a
non-existent path (via tmp_path) alongside clear_cache(), ensuring
the env var wins as intended.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix: Treat delivery-confirmed proxy errors as delegation success
When proxyA2ARequest returns an error but we have a non-empty
response body with a 2xx status code, the agent completed the work
successfully. The error is a delivery/transport error (e.g., connection
reset after response was received).
Previously, executeDelegation would mark these as "failed" even though
the work was done, causing:
- Retry storms (canvas suggests restart, user retries)
- "error" rendering in canvas even though result is available
- Data loss risk from unnecessary restarts
Now we check for valid response data before marking as failed.
Fixes issue #159.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #152: claude-code workspace plugin adapter import fails with
'No module named plugins_registry'. Plugin adapter code
(workspace-template-*) uses bare `from plugins_registry import ...`
but molecule-runtime only shipped it at
molecule_runtime/plugins_registry/ (the package namespace path).
Fix: copy workspace/plugins_registry/ to the top level of the wheel
in addition to molecule_runtime/plugins_registry/. Both copies coexist
— the top-level one satisfies bare imports from plugin adapters,
the nested one satisfies the rewritten
`from molecule_runtime.plugins_registry import ...` in adapter_base.py.
pyproject.toml updated to include plugins_registry* in the packages find
directive so setuptools ships it from the wheel root.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the "no keyboard shortcut help dialog" audit gap (MEDIUM).
Changes:
- Add KeyboardShortcutsDialog component: portal-based, accessible
dialog listing all canvas + navigation + agent shortcuts grouped by
category. WCAG 2.1 compliant (focus trap, Esc close, aria-modal,
aria-labelledby, focus restoration on close).
- Add global ? shortcut: opens the dialog when pressed outside any
input field and no modal is already open.
- Add "See all shortcuts →" link in the Toolbar quick-start popup
linking to the dialog.
Test plan:
- [x] npx vitest run (182 tests pass)
- [x] tsc --noEmit (no type errors)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue: HIGH priority item from canvas accessibility audit (2026-05-09).
Screen reader users had no way to know when workspace status changed
— the canvas updated visually but no announcement was made.
Changes:
- canvas.ts: add `liveAnnouncement: string` + `setLiveAnnouncement` to
CanvasState so the store can hold the current announcement text.
- canvas-events.ts: set `liveAnnouncement` in handleCanvasEvent for 6
key status transitions: ONLINE, OFFLINE, PAUSED, DEGRADED, PROVISIONING,
REMOVED, PROVISION_FAILED. Names are looked up from store nodes so
announcements are human-readable ("Alpha is now online" not "ws-1").
TASK_UPDATED and AGENT_MESSAGE are intentionally excluded — they fire
on every heartbeat and would overwhelm the user.
- Canvas.tsx: subscribe to `liveAnnouncement` from the store; render a
visually-hidden `aria-live="polite" aria-atomic="true"` region that
speaks the announcement then clears it after 500 ms so the same
message doesn't re-announce on re-render. Fallback still announces
workspace count on initial load.
- canvas-events.test.ts: 12 new test cases covering announcement
content for all 6 event types, empty/no-announcement cases, and
payload-name fallback when a node isn't yet in the store.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Issue #159: successful delegation responses were rendered as error
banners because extractResponseText() only handled the A2A result
format (body.result.parts[].text) but delegation.go stores
response_body as {text: "...", delegation_id: "..."}. The error
status was set when the HTTP transport failed even though the actual
agent response was received.
Fixes:
1. extractResponseText: check body.text before the result path so
delegation response_body.text is extracted correctly
2. extractResponseText: also check body.response_preview (WS event shape
from DELEGATION_COMPLETE handler)
3. GroupedCommsView: render NormalMessage when status=error but
responseText is populated (delegation succeeded, transport failed)
instead of burying the content in an error banner
Tests: 8 new cases (4 extractResponseText + 2 extractRequestText
regression + 2 render tests). 189 tests pass across 10 files.
Closes#159.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
[core-lead-agent] Closes Core-Security audit finding (2026-05-09 audit cycle, MEDIUM):
1. workspace-server/internal/handlers/workspace_crud.go:335
`DELETE /workspaces/:id` returned `err.Error()` verbatim in the 500
body, leaking wrapped lib/pq driver strings (schema column names,
index hints) to HTTP clients. Replaced with sanitized message;
raw error already logged server-side via the existing log.Printf
immediately above.
2. workspace-server/internal/handlers/org.go:610
`OrgImport` echoed the user-supplied `body.Dir` verbatim in the 404
"org template not found: %s" response. Path traversal is already
blocked by resolveInsideRoot earlier in the handler, but echoing
raw input back lets a client probe filesystem layout (404-with-echo
vs. 400-from-resolve is itself a signal). Dropped the input from the
client-facing message; preserved full context in a new log.Printf
(orgFile path + the requested body.Dir) for operator triage.
Both fixes preserve operator-side diagnostics (logs unchanged in
content, only client-facing JSON sanitized). No behavior change for
legitimate clients — error type, status code, and JSON shape all stay
the same.
Tier: low. Defensive hardening only; reduces info-disclosure surface
without altering control-flow or auth gates.
Agent Comms tab rendered outbound delegations as blank bubbles because
extractRequestText only checked the A2A JSON-RPC format
(body.params.message.parts[].text) while delegation.go stores
request_body as {"task": "...", "delegation_id": "..."}.
Fix: check body.task first for delegation activities, then fall back to
the A2A format. Add six test cases covering the delegation shape,
precedence over A2A params when both present, empty-string guard, and
non-string type guard.
Closes#158.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
[FORCE-MERGE AUDIT — §SOP-7]
- Approver: hongming via chat-go ("go") in conversation transcript ~21:00 UTC on 2026-05-09
- Bypassed: required status checks (all pending — runner pickup issue, separate from PR correctness)
- Audit channel: orchestrator force-merge log + this commit message
Part of overnight team shipping cycle. PR authored by team persona under per-persona Gitea identity (post #156 merge).
Renames Docker network across all code, configs, scripts, and docs.
Per issue #93: the network was named molecule-monorepo-net as a holdover
from when the repo was called molecule-monorepo. The canonical repo name is
now molecule-core, so the network should be molecule-core-net.
Files changed:
- docker-compose.yml, docker-compose.infra.yml: network definition
- infra/scripts/setup.sh: docker network create
- scripts/nuke-and-rebuild.sh: docker network rm
- workspace-server/internal/provisioner/provisioner.go: DefaultNetwork
- All comments/docs: updated wording
Acceptance: grep -rn 'molecule-monorepo-net' returns zero matches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MCP delegate_task and delegate_task_async bypassed the delegation activity
lifecycle entirely — no activity_log row was written for MCP-initiated
delegations. As a result the canvas Agent Comms tab rendered outbound
delegations as bare "Delegation dispatched" events with no task body.
Fix: insert a delegation row (mirroring insertDelegationRow from
delegation.go) before the A2A call so the canvas can show the task text.
The sync tool updates status to 'dispatched' after the HTTP call; the
async tool inserts with 'dispatched' directly (goroutine won't update).
Closes#158.
Closes#49 (partial — addresses the canvas-display gap; full lifecycle
parity requires DelegationWriter extraction, tracked separately).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per issue #153: `docker compose up -d` (docker-compose.yml) did not start
Temporal because it lived only in docker-compose.infra.yml. Users had to know
to run `setup.sh` which explicitly uses `-f docker-compose.infra.yml`.
Adding `include: - docker-compose.infra.yml` makes the full infra stack
(starting with Temporal) start with the default `docker compose up` command.
Both compose files define postgres/redis — the main file's definitions take
precedence via compose merge semantics, so no service conflicts.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Major correction from Core-FE review:
- Canvas has THREE themes: System/Light/Dark, not dark-only
- Warm paper tones for light, zinc-adjacent dark for dark mode
- ThemeProvider handles switching, persisted in mol_theme cookie
- Use semantic tokens: bg-surface, bg-surface-card, border-line, text-ink
- NEVER use raw zinc for surfaces — only for borders/disabled/code
Updated:
- Section 1: Three-mode theme palette with exact hex values
- Section 4: Component patterns now use semantic tokens
- Added Section 4.6: ThemeProvider + useTheme() usage
- Section 7: Enforcement checklist now includes token rules
Co-Authored-By: Core-FE <core-fe@moleculesai.app>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cross-reference the Core-FE draft against actual molecule-core/canvas/src/
codebase. Creates two new docs:
- canvas-design-system-v1.md: Full design system with verified color
palette, typography scale, animation tokens (from theme-tokens.css),
component patterns, WCAG 2.1 AA checklist. Marks all items as
VERIFIED with source file citations.
- canvas-audit-items.md: Updated architecture brain dump with verified
findings on React Flow canvas accessibility. Flags remaining gaps
(screen reader announcements, keyboard shortcuts help, keyboard drag).
Key verified discrepancies from draft:
- Font: system-ui stack (not Inter/Geist)
- Tooltip: uses aria-describedby + role=tooltip (not group-hover CSS)
- Animation tokens: already defined in theme-tokens.css
- ContextMenu: has full keyboard nav (arrow keys, wrap-around)
Co-Authored-By: Core-FE <core-fe@moleculesai.app>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes#155.
Without this, every commit from a workspace booted via the standard
provisioner lands with an empty `user.name`/`user.email` and Gitea
attributes the work to whichever PAT pushed (typically the founder's
`claude-ceo-assistant`), instead of the persona that actually authored
the commit. That's the same fingerprint pattern that got us suspended
on GitHub 2026-05-06.
GITEA_USER is already injected per-workspace by the provisioner from
workspace_secrets (verified: 8/8 Core-* workspaces have it set,
correctly-named, on operator + local). Boot picks it up unconditionally;
falls through cleanly if unset (e.g. legacy boxes without persona
identity wiring).
Email uses `bot.moleculesai.app` so agent commits are visually distinct
from human-authored commits in Gitea history. The `gitconfig` copy from
`/root/.gitconfig` to `/home/agent/.gitconfig` is now unconditional —
previously it was nested inside the `molecule-git-token-helper.sh`
block, which meant the per-persona identity wouldn't propagate to the
agent user when the helper was unavailable.
Also added an inline note that the github.com credential-helper block
is post-suspension legacy. Full removal tracked under #171; this PR
deliberately doesn't touch it (smaller blast radius).
Tested: docker exec sets the same config in 8 running Core-* workspaces
locally and they pick up correct identity for `git config -l`. Will
reset when those containers restart, hence this PR for the persistent
fix.
molecule-core/main branch protection requires the status-check context
'Secret scan / Scan diff for credential-shaped strings (pull_request)'
but the workflow lived only in .github/workflows/, which Gitea Actions
doesn't see — every PR's required-status-checks rollup left the context
in 'expected' / never-fires state, blocking merge.
Port to .gitea/workflows/secret-scan.yml. Drops:
- merge_group event (Gitea has no merge queue)
- workflow_call (no cross-repo reusable invocation on Gitea)
SELF exclude lists both .github/ and .gitea/ paths so a future sync
between them stays clean. Job + step names match the GitHub workflow
so the produced status-check context name matches branch protection
unchanged.
Same regex set as the runtime's pre-commit hook
(molecule-ai-workspace-runtime: molecule_runtime/scripts/pre-commit-checks.sh).
This unblocks PR #150 (audit-force-merge fan-out) and every future
PR on molecule-core/main.
Mirrors the canonical workflow shipped on internal#120 + #122. Same
shape: pull_request_target on closed, base.sha checkout, structured
JSON event to runner stdout that Vector ships to Loki on
molecule-canonical-obs.
REQUIRED_CHECKS env declares both molecule-core/main protected
contexts (sop-tier-check + Secret scan). Mirror against branch
protection if either is added/removed.
Verified end-to-end on internal: synthetic force-merge of internal#123
emitted incident.force_merge with all expected fields, indexable in
Loki via {host="molecule-canonical-1"} |= "incident.force_merge".
Tier: low (CI workflow, no platform code path).
Closes the post-PR-#174 self-review gap: the matched-pair contract
between ADMIN_TOKEN (server-side bearer gate) and NEXT_PUBLIC_ADMIN_TOKEN
(canvas client-side bearer attach) was descriptive only, living in a
.env file comment. Future agents/devs could re-misconfigure with one
of the two unset and silently 401 — every workspace API call refused
with no actionable diagnostic.
Adds checkAdminTokenPair() to canvas/next.config.ts, run after
loadMonorepoEnv() so it sees the post-load state. Two distinct
warnings (server-set/client-unset and the inverse) so an operator can
tell which half is missing without grep'ing. Empty string is treated
as unset so KEY= and unset KEY produce the same verdict.
Warn-only, not exit — production canvas Docker images bake these vars
at image-build time and a hard exit would turn a recoverable auth
issue into a crashloop. The console.error fires in `next dev`, the
standalone server's stdout, and the canvas Docker container logs —
the three places an operator looks when "everything 401s."
Tests pin exact stderr strings (per feedback_assert_exact_not_substring)
across 6 cases: both unset, both set, ADMIN_TOKEN-only, NEXT_PUBLIC-only,
empty-string-as-unset, and the empty-string-asymmetric mismatch.
Mutation-tested: flipping the if-condition from === to !== fails all 6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The forks pool's implicit maxWorkers=1 (2-CPU runner) was insufficient
to prevent concurrent jsdom worker cold-starts. Each jsdom worker
allocates ~30-50 MB RSS at boot; multiple workers starting simultaneously
exhaust available memory, causing 5 test files to fail with:
[vitest-pool]: Failed to start forks worker for test files ...
[vitest-pool-runner]: Timeout waiting for worker to respond
Individual jsdom test files take 12-15 s in isolation and pass cleanly.
Failures only occur when 51 files are run together through the pool.
Fix: explicitly set maxWorkers:1 so a single worker processes all files
sequentially, eliminating concurrent jsdom bootstrap memory pressure.
With this change, all 51 files pass (was 46 pass + 5 fail), and suite
duration improves from ~5070 s to ~1117 s because workers no longer
compete for resources during startup.
Ref: issue #148
Ref: vitest-pool investigation for issue #22 (canvas side)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the canonical refactor: workflow YAML shrinks (env+invocation),
logic moves to .gitea/scripts/sop-tier-check.sh, debug echoes gated on
SOP_DEBUG, checkout@v6 pinned to base.sha.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fans the security fix from internal#116 (cce89067) to molecule-core. Same
rationale: pull_request loads workflow from PR HEAD, allowing any
write-access contributor to rewrite the workflow file in their PR and
exfiltrate SOP_TIER_CHECK_TOKEN. pull_request_target loads from base
(main), neutralising the attack.
Verified post-merge on internal: synthetic PR rewriting the workflow to
print the token did NOT execute the modified version — main's
pull_request_target version ran instead. ATTACK_PROBE never fired.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase-1 fan-out of §SOP-6 enforcement to molecule-core. No branch
protection change in this PR — workflow runs and reports a status,
doesn't block any merge yet.
Branch protection update is the follow-up PR after the workflow
demonstrates a green run on its own PR, per the Phase 2 plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The org.import.started event was firing immediately after request body
bind, before the YAML at body.Dir was loaded. Result: payload.name was
"" whenever the caller passed `dir` (the common path — the canvas and
all live imports use dir, not inline template). Three started rows
already in the local platform's structure_events have empty name.
Fix: move the started emit (and importStart timestamp) to after the
YAML unmarshal / inline-template fallthrough, where tmpl.Name is
guaranteed populated.
Bonus: pre-parse error returns (invalid body, traversal-rejected dir,
file-not-found, YAML expansion fail, YAML unmarshal fail, neither dir
nor template provided) no longer emit an orphan started row — every
started is now guaranteed a paired completed/failed.
Verified live against running platform: re-imported molecule-dev-only,
new started row in structure_events carries
"Molecule AI Dev Team (dev-only)" instead of "".
Tests: full handler suite green (`go test ./internal/handlers/`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops ~150 lines of duplicated cascade logic from the Delete HTTP
handler — workspace_crud.go's CascadeDelete (added in PR #137) and
Delete() were running the same #73 race-guard sequence (status update →
canvas_layouts → tokens → schedules → container stop → broadcast),
just with Delete() inlined and CascadeDelete owning the OrgImport
reconcile path.
CascadeDelete now returns the descendant id list (was: count) so
Delete() can drive the optional ?purge=true hard-delete against the
same set the cascade just touched.
Net diff: workspace_crud.go shrinks from ~270 lines in Delete() to
~75 lines (parse + 409 confirm gate + CascadeDelete call + stop-error
500 + purge block + 200 response). Behavior identical — same SQL
ordering, same #73 race guard, same response shapes. Three sqlmock
tests for the 0-children case gained one extra ExpectQuery for the
recursive-CTE descendants scan (the old inline code skipped that
query when len(children)==0; CascadeDelete walks unconditionally —
returns 0 rows, same end state, one extra cheap query).
Tests: full handler suite green (`go test ./internal/handlers/`).
Live-tested against the running local platform: DELETE on a fake
workspace returns `{"cascade_deleted":0,"status":"removed"}`,
fleet of 9 workspaces preserved, refactored handler matches the
prior wire-shape exactly.
Tracked as the PR #137 follow-up tech-debt item.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the additive-import zombie bug — re-running /org/import with a
tree shape that reparents same-named roles left the prior workspace
online because lookupExistingChild's dedupe is parent-scoped (different
parent_id → "different" workspace). Caught 2026-05-08 after a dev-tree
re-import left 8 orphans co-existing with the new tree on canvas until
manual cascade-delete.
Three layers in this PR:
- mode="reconcile" on /org/import — after the import loop, online
workspaces whose name matches an imported name but whose id isn't in
the result set are cascade-deleted. Default mode "" / "merge"
preserves existing additive behavior. Empty-set guards prevent
accidental "delete everything" if either array comes up empty.
- WorkspaceHandler.CascadeDelete extracted as a callable helper from
the existing Delete HTTP handler so OrgImport's reconcile path shares
the same teardown sequence (#73 race guard, container stop, volume
removal, token revocation, schedule disable, event broadcast). The
HTTP Delete handler still inlines the same logic; deduplication
tracked as tech-debt follow-up.
- emitOrgEvent(structure_events) records org.import.started +
org.import.completed with mode, created/skipped/reconcile_removed
counts, duration_ms, error. Replaces the lost-on-restart stdout-only
log shape for an audit-trail surface that's queryable by SQL. Closes
the "what happened at 20:13?" debugging gap that motivated this fix.
Verified live against the local platform: cascade-delete on an old
tree's removed root cleared 8 surviving orphans; mode="reconcile" with
a freshly-INSERTed fake orphan removed exactly the fake; idempotent
re-run of reconcile is a no-op (0 removed, no errors); structure_events
captures every started+completed pair with full payload.
7 new unit tests (walkOrgWorkspaceNames flat/nested/spawning:false/
empty-name; emitOrgEvent success + DB-error-swallow; errString). Full
handler suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 follow-up to template-claude-code PR #9 (2026-05-08 dev-tree wedge).
Pre-fix: applyRuntimeModelEnv unconditionally overwrote envVars["MODEL"]
with the MODEL_PROVIDER slug whenever payload.Model was empty (the restart
path). This silently wiped the operator'\''s explicit per-persona MODEL
secret on every restart.
Symptom: dev-tree workspaces booted correctly on first /org/import (the
envVars map was populated direct from the persona env file with both
MODEL=MiniMax-M2.7-highspeed and MODEL_PROVIDER=minimax), then on the
next Restart the MODEL secret got clobbered to literal "minimax" — a
provider slug, not a valid model id — and the workspace template'\''s
adapter failed to match any registry prefix, fell through to providers[0]
(anthropic-oauth), and wedged at SDK initialize.
Fix: resolution order in applyRuntimeModelEnv is now:
1. payload.Model (caller passed the canvas-picked model id verbatim)
2. envVars["MODEL"] (workspace_secret persisted from persona env)
3. envVars["MODEL_PROVIDER"] (legacy canvas Save+Restart shape)
Tests
-----
TestApplyRuntimeModelEnv_PersonaEnvMODELSecretPreserved — locks in
the new resolution order with four cases:
- MODEL secret wins over MODEL_PROVIDER slug (persona-env shape)
- MODEL secret wins even when same as MODEL_PROVIDER
- MODEL absent → fall back to MODEL_PROVIDER (legacy shape)
- Both absent → no MODEL set (no-op)
Existing TestApplyRuntimeModelEnv_SetsUniversalMODELForAllRuntimes
continues to pass — fix is strictly additive on the precedence chain.
Lets a workspace declare it (and its entire subtree) should be skipped
during /org/import. Pointer-typed `*bool` so we distinguish "explicitly
false" from "unset" (default = spawn).
## Use case
The dev-tree org template ships the full role taxonomy (Dev Lead with
Core Platform / Controlplane / App & Docs / Infra / SDK Leads, each with
their own engineering / QA / security / UI-UX children — 27 personas
total in a single import). Some setups need a smaller set:
- Local dev on a memory-constrained machine
- Demo / smoke runs that don't need the full org breathing
- Customer trials starting with leadership-only before fan-out
Pre-fix the only options were:
- Edit the canonical template (mutates shared state)
- Author a parallel slimmer template (duplicates structure)
- Manual workspace deprovision after full import (wasteful — already paid
the docker pull / build cost)
`spawning: false` is the per-workspace knob that solves this without
touching the canonical template structure.
## Semantics
- Unset: workspace spawns (current behaviour, no migration)
- `spawning: true`: explicitly spawns (same as unset)
- `spawning: false`: workspace is skipped AND every descendant is
skipped. The guard sits BEFORE any side effect in
createWorkspaceTree — no DB row, no docker provision, no children
recursion. A false-spawning subtree is genuinely a no-op except for
the log line. countWorkspaces still counts the subtree (so /org/templates
numbers reflect the full structure).
## Stage A — verified
Local dev-only template that wraps teams/dev.yaml (Dev Lead) with
children:[] cleared on the 5 sub-team yaml files, plus 3 floater
personas (Release Manager / Integration Tester / Fullstack Engineer).
/org/import returned 9 workspaces. Drop-in: same result via
`spawning: false` on each sub-tree root in the future.
## Stage B — N/A
Pure additive feature on the org-template handler. No SaaS deploy chain
implications.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## org_import.go — persona env injection root-cause fix
The Phase-3 fix from earlier today (`feedback/per-agent-gitea-identity-default`)
introduced loadPersonaEnvFile to inject persona-specific creds into
workspace_secrets on /org/import. It passed `ws.Role` as the persona-dir
lookup key, but in our dev-tree org.yaml shape `role:` carries the
multi-line descriptive text the agent reads from its prompt
("Engineering planning and team coordination — leads Core Platform,
Controlplane, ..."), while `files_dir:` holds the short slug
(`core-lead`, `dev-lead`, etc.) matching
`~/.molecule-ai/personas/<files_dir>/env`.
isSafeRoleName silently rejected the multi-word role text → no persona
env loaded → every imported workspace booted with zero
workspace_secrets rows → no ANTHROPIC / CLAUDE_CODE / MINIMAX auth in
the container env → claude_agent_sdk wedged on `query.initialize()`
with a 60s control-request timeout.
After the fix, /org/import on the dev tree (27 personas) populates
8 workspace_secrets per workspace (Gitea identity + MODEL/MODEL_PROVIDER
+ provider-specific token), 5 of 6 leads boot online, and the
remaining wedges trace to a separate runtime-template-repo bug
(workspace-template-claude-code's claude_sdk_executor.py doesn't
dispatch on MODEL_PROVIDER=minimax — filed separately).
## Dockerfile.dev — docker-cli + docker-cli-buildx
Without these, every claude-code/tier-2 workspace POST fails-fast:
- docker-cli alone produces `exec: "docker": executable file not found`
- docker-cli alone (no buildx) fails on `docker build` with
`ERROR: BuildKit is enabled but the buildx component is missing or broken`
Both packages are now installed in the dev image; verified with
`docker exec molecule-core-platform-1 docker buildx version`.
## Stage A verified
Local /org/import dev-only path: 27 workspaces created, all 27 receive
persona env injection (8 secrets each — Gitea identity + provider creds).
Lead workspaces (claude-code-OAuth tier) boot online.
## Stage B — N/A
Local-dev-only path (docker-compose.dev.yml + dev image). Tenant EC2
provisioning uses Dockerfile.tenant (untouched).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the workspace-template visibility flip in 558e4fee. After
flipping the 5 private workspace-templates public (#192 root cause),
the harness-replays clone moved one step deeper to the org-templates
list, where 6 of 7 were also private. Hongming-confirmed flip plan:
- 5 of 6 (molecule-dev, free-beats-all, medo-smoke, molecule-worker-gemini,
ux-ab-lab) — flipped public per `feedback_oss_first_repo_visibility_default`.
These are unambiguously OSS-template-shape: generic README, no
customer-shaped names, no creds in content.
- 1 of 6 (reno-stars) — name itself is customer-shaped (would expose
customer/tenant identity). Kept private; removed from manifest.json
per Hongming. Will be handled at provision-time via the per-tenant
credential resolver designed in internal#102 (Layer-3 RFC).
Documents the OSS-surface contract in two places:
- manifest.json _comment: every entry MUST be public; Layer-3 lives elsewhere
- clone-manifest.sh comment block: rationale + the explicit ci-readonly
team-grant escape hatch (review-gated, not default).
Closes the second clone-fail layer of #192. Combined with 558e4fee +
the workspace-template visibility flips, the Pre-clone manifest deps
step should now succeed anonymously for the full registered set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 of 9 workspace-template repos (openclaw, codex, crewai, deepagents,
gemini-cli) had been marked private with no team grant for AUTO_SYNC_TOKEN
bearer (devops-engineer persona). Pre-clone manifest deps step 404'd on
the first private repo encountered, failing every Harness Replays run.
Resolution path taken:
1. Flipped the 5 to public per `feedback_oss_first_repo_visibility_default`
— runtime/template/plugin repos default public; that's what makes them
OSS surface.
2. Scoped existing `ci-readonly` org team to legitimately-internal repos
only (compliance docs, RFCs-in-flight). Workspace templates removed
from it.
3. Filed internal#102 RFC for Layer-3 (customer-owned + marketplace
third-party private repos) — that's a different shape entirely;
needs per-tenant credential-resolver, not org-team grants.
This commit is a documentation-only touch on the workflow file to (a)
record the root cause inline next to the existing pre-clone-fail
narrative, (b) trigger a fresh Harness Replays run that should now pass
the clone step.
Closes#192.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Investigating molecule-core#129 failure mode #1 (claude-code "Agent
error (Exception)") needs the workspace's docker logs to find the
actual exception. The canary tears down the tenant on every failure,
so the workspace container is destroyed before anyone can SSM in.
Add a workflow_dispatch input `keep_on_failure: bool` (default false).
When true, sets `E2E_KEEP_ORG=1` for the canary script — its existing
debug path skips teardown, leaving the tenant + EC2 + CF tunnel + DNS
alive. Operator can then SSM into the workspace EC2 (via the same
flow as recover-tunnels.py) and capture `docker logs` from the
claude-code container.
Cron-triggered runs never set the input (it only exists on dispatch),
so unattended scheduled canaries always tear down — no risk of
unattended cost leak.
Operator workflow:
1. Dispatch canary-staging.yml with keep_on_failure=true
2. Watch CI; on failure (likely, given the 38h chronic red),
note the SLUG / TENANT_URL printed at step 1/11
3. SSM exec into the workspace EC2 (us-east-2) and run
`docker logs <claude-code-container>` to find the actual
exception traceback
4. Manually delete via DELETE /cp/admin/tenants/<slug> when done
(the script logs this reminder on E2E_KEEP_ORG=1 path)
Refs: molecule-core#129 (canary investigation)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the legacy nohup `go run ./cmd/server` setup with a fully
containerized local stack: postgres + redis + platform + canvas, all
with `restart: unless-stopped` so they survive Mac sleep/wake and
Docker Desktop daemon restarts.
## Changes
- **docker-compose.yml**
- `restart: unless-stopped` on platform/postgres/redis
- `BIND_ADDR=0.0.0.0` for platform — the dev-mode-fail-open default
of 127.0.0.1 (PR #7) made the host unable to reach the container
even with port mapping. Container netns is already isolated, so
binding all interfaces inside is safe.
- Healthchecks switched from `wget --spider` (HEAD → 404 forever
because /health is GET-only) to `wget -qO /dev/null` (GET).
Same regression existed on canvas; fixed both.
- **workspace-server/Dockerfile.dev**
- `CGO_ENABLED=1` → `0` to match prod Dockerfile + Dockerfile.tenant.
Without this, the alpine dev image fails with "gcc: not found"
because workspace-server has no actual cgo deps but the env was
forcing the cgo build path. Closes a divergence introduced in
9d50a6da (today's air hot-reload PR).
- **canvas/Dockerfile**
- `npm install` → `npm ci --include=optional` for lockfile-exact
installs that include platform-specific @tailwindcss/oxide native
binaries. Without these, `next build` fails with "Cannot read
properties of undefined (reading 'All')" on the
`@import "tailwindcss"` directive.
- **canvas/.dockerignore** (new)
- Excludes `node_modules` and `.next` so the Dockerfile's
`COPY . .` step doesn't clobber the freshly-installed container
node_modules with the host's (potentially stale or wrong-arch)
copy. This was the actual root cause of the canvas build break.
- **workspace-server/.gitignore**
- Adds `/tmp/` for air's live-reload build cache.
## Stage A verified
```
container status restart
postgres-1 Up (healthy) unless-stopped
redis-1 Up (healthy) unless-stopped
platform-1 Up (healthy, air-mode) unless-stopped
canvas-1 Up (healthy) unless-stopped
GET :8080/health → 200
GET :3000/ → 200
DB preserved: 407 workspace rows + 5 named personas
Persona mount: 28 dirs at /etc/molecule-bootstrap/personas
```
## Stage B — N/A
This is local-dev infrastructure only. None of these files ship to
SaaS tenants — production EC2s use `Dockerfile.tenant` + `ec2.go`
user-data, not docker-compose.
## Out of scope
- The decorative-but-broken `wget --spider` healthcheck has presumably
also been silently 404'ing on prod tenants. Ship a follow-up to
audit + fix the prod path; not done here to keep the PR scoped.
- Docker Desktop "Start at login" is a per-machine GUI setting that
must be toggled manually (Settings → General).
- The legacy heartbeat-all.sh that pinged 5 persona workspaces from
the host has been deleted (~/.molecule-ai/heartbeat-all.sh).
Per Hongming: each workspace is responsible for its own heartbeat.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "Open issue on failure" step was failing on every canary run
because Gitea 1.22.6 doesn't expose /api/v1/actions endpoints
(per memory reference_gitea_actions_log_fetch). The threshold check
called github.rest.actions.listWorkflowRuns() to count consecutive
prior failures and gate issue creation behind 3 reds — that call
ALWAYS 404'd on Gitea, breaking the entire alerting step.
Net effect: the canary's own self-alerting was broken, so the
underlying staging regression went unflagged for 38h+
(2026-05-07 02:30 UTC → 2026-05-08 17:34 UTC, every cron tick red,
zero issues filed).
Fix: drop the consecutive-failures threshold entirely. File a
sticky issue on the FIRST failure; comment-on-existing handles
deduplication for subsequent failures. The auto-close-on-success
step is unchanged.
Why not a Gitea-compatible threshold (e.g., walk recent commit
statuses): comment-on-existing already gives ops a single
accumulating issue per regression streak. The threshold's purpose
was to avoid spamming on transient flakes — but with sticky issue
+ auto-close-on-green, transient flakes get one issue + one quick
close, which is fine signal. Filing on first failure is also
better UX: catches the regression in 30 min instead of 90 min.
Also: rewrote runURL from hardcoded https://github.com/... to
context.serverUrl so the link actually points at Gitea
(https://git.moleculesai.app) — was always broken on Gitea but
nobody noticed because the issue-filing step itself was broken.
Net: 21 insertions, 40 deletions. Removes WORKFLOW_PATH +
CONSECUTIVE_THRESHOLD env vars (no longer needed).
Tracked in: molecule-core#129 (failure mode 3 of 3)
Verification: yaml syntax-valid; no remaining github.rest.actions.*
calls; only github.rest.issues.* (all Gitea-supported per
memory feedback_persona_token_v2_scope).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes core#242 LOCAL surface. The PROD surface (CP user-data fetching
persona env files into tenant EC2's /etc/molecule-bootstrap/personas
via Secrets Manager) is filed as a follow-up.
WHAT THIS ADDS
Bind-mount on the platform service in docker-compose.yml:
${MOLECULE_PERSONA_ROOT_HOST:-${HOME}/.molecule-ai/personas}
→ /etc/molecule-bootstrap/personas (read-only)
Default source = ${HOME}/.molecule-ai/personas (the operator-host-mirrored
local dir populated by today's persona rotation work). Override via
MOLECULE_PERSONA_ROOT_HOST when running on a machine with a different
layout (CI runners, etc.).
WHY READ-ONLY
workspace-server only reads persona env files; never writes back. The
read-only mount enforces that contract — a hostile plugin install path
can't tamper with the persona credentials it's about to consume.
WHY THIS PATH MATCHES PROD
/etc/molecule-bootstrap/personas is the same in-container path the
prod tenant EC2 will use. Same code path (org_import.go::loadPersonaEnvFile)
reads the same file regardless of mode — local-dev parity with prod
per feedback_local_must_mimic_production.
STAGE A VERIFICATION
- docker compose config: resolves to /Users/hongming/.molecule-ai/personas
correctly (28 persona dirs visible at source path)
- Persona env file shape verified: dev-lead's env contains GITEA_USER,
GITEA_USER_EMAIL, GITEA_TOKEN_SCOPES, GITEA_SSH_KEY_PATH,
MODEL_PROVIDER=claude-code, MODEL=opus (lead tier matches Hongming's
2026-05-08 mapping)
- Full handler test suite green (TestLoadPersonaEnvFile_HappyPath +
7 sibling tests pass; rejection tests still catch path traversal)
- Build clean
STAGE B SKIPPED (with justification per § Skip conditions)
This change is config-only (docker-compose.yml volume addition). The
prod tenant EC2s do NOT use docker-compose.yml — they use CP user-data
+ ec2.go's docker run script. So this PR has no prod blast radius.
Stage B (staging tenant probe) would be checking 'is the platform
using the new compose mount' on a SaaS tenant — and SaaS tenants
don't run docker compose. The actual prod-surface change is the
follow-up issue.
PROD SURFACE — FOLLOW-UP FILED
Tenant EC2 user-data needs to fetch persona env files from operator
host (or AWS Secrets Manager per the established
feedback_unified_credentials_file pattern) and stage them at
/etc/molecule-bootstrap/personas inside the workspace-server container.
Touches molecule-controlplane/internal/provisioner/ec2.go user-data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes core#115 partial. Schema-only change; the apply-endpoint filter
logic that reads this column lands with core#123 (drift detector +
queue + apply endpoint, the deferred follow-up of core#113).
Default 'production' so existing customers (Reno-Stars + any future
tenant) are default-safe. Synthetic dogfooding workspaces opt INTO
'canary' explicitly.
CHECK constraint pins the closed value set ('canary' | 'production') —
the apply endpoint's filter relies on the database to reject anything
else, so a future operator typo in PATCH /workspaces/:id ({update_tier:
'canery'}) returns a constraint violation, not silent fan-out to
nobody.
Partial index on canary rows since the apply-endpoint query path
('apply this update only to canary tier first') hits canary much more
often than production, and the production set is the much larger
default.
WHAT THIS DOES NOT DO (lands with core#123)
- PATCH endpoint to flip a workspace to canary
- The apply endpoint that consults the column
- Tests that exercise canary-vs-production fan-out
Schema-only foundation; same pattern as core#113 (workspace_plugins).
PHASE 4 SELF-REVIEW
Correctness: No finding — IF NOT EXISTS guards, DEFAULT clause means
existing rows get 'production' on migration apply.
Readability: No finding — comment block documents the tier semantics
+ the deferral to core#123.
Architecture: No finding — additive ALTER, partial index for the
expected access pattern.
Security: No finding — no code path; column constraint reduces blast
radius of bad PATCH input.
Performance: No finding — partial index minimizes write amplification
on the production-default rows.
REFS
core#115 — this issue
core#123 — apply endpoint follow-up (will exercise this column)
core#113 — version subscription DB foundation (sibling pattern)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the post-Task-#176 self-review gap: the bearer-token + tenant-
slug header construction was duplicated across 7 raw-fetch callsites
in the canvas (lib/api.ts request(), uploads.ts × 2, and 5 Attachment*
components). Each callsite read NEXT_PUBLIC_ADMIN_TOKEN, attached
Authorization: Bearer manually, computed getTenantSlug locally
(three of them inline-redefined it from /lib/tenant!), and attached
X-Molecule-Org-Slug. A new poller / raw-fetch added without going
through this exact recipe silently 401s against workspace-server when
ADMIN_TOKEN is set on the server side — the bug shape called out in
the original task.
Adds platformAuthHeaders() to lib/api.ts as the single source of truth
and routes all 7 raw-fetch callsites through it. Removes 4 duplicate
local getTenantSlug() copies (Image, Video, Audio, PDF, TextPreview)
that were inline-redefining what /lib/tenant.ts already exports.
Also preserves the AttachmentTextPreview off-platform branch — when
isPlatformAttachment() is false, headers is {} (no bearer leakage to
third-party URLs).
Tests:
- 6 unit tests in platform-auth-headers.test.ts covering: empty,
bearer-only, slug-only, both, empty-string-as-unset, fresh-object-
per-call. Mutation-tested: removing the bearer attach inside the
helper fails 2 of 6 tests immediately.
- All 1389 existing canvas vitest tests pass unchanged.
- npx tsc --noEmit clean.
- npm run build succeeds (canvas Next.js build).
Per feedback_assert_exact_not_substring: tests use exact toEqual()
equality, not substring/contains, so an extra-header bug also fails
the assertion. Per feedback_oss_design_philosophy: this is the
"plugin/abstract/modular/SSOT" move applied to the auth-header
construction surface — one helper, six call sites, no duplication.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 14:36:02 -07:00
72 changed files with 3239 additions and 425 deletions
echo" [debug] teams-list body (first 300 chars):" >&2
head -c 300"$ORG_TEAMS_FILE" >&2;echo >&2
fi
if["$HTTP_CODE" !="200"];then
echo"::error::GET /orgs/${OWNER}/teams returned HTTP $HTTP_CODE — token likely lacks read:org scope. Add a SOP_TIER_CHECK_TOKEN secret with read:organization scope at the org level."
exit1
fi
declare -A TEAM_ID
for T in $ELIGIBLE;do
ID=$(jq -r --arg t "$T"'.[] | select(.name==$t) | .id' <"$ORG_TEAMS_FILE"| head -1)
debug "probe: $U in team $T (id=$ID) → HTTP $CODE"
if["$CODE"="200"]||["$CODE"="204"];then
echo"::notice::approver $U is in team $T (eligible for $TIER)"
OK="yes"
break
fi
done
[ -n "$OK"]&&break
done
if[ -z "$OK"];then
echo"::error::Tier $TIER requires approval from a non-author member of {$ELIGIBLE}. Got approvers: $APPROVERS — none of them satisfied team membership. Set SOP_DEBUG=1 to see per-probe HTTP codes."
exit1
fi
echo"::notice::sop-tier-check passed: $TIER, approver in {$ELIGIBLE}"
@@ -67,7 +67,7 @@ On-demand fits naturally with how agents work — an agent only needs to know ab
This is acceptable for MVP because:
- All workspaces are provisioned by the same platform on trusted infrastructure
- Docker network isolation (`molecule-monorepo-net`) limits who can reach workspace endpoints
- Docker network isolation (`molecule-core-net`) limits who can reach workspace endpoints
- The tool is self-hosted — the operator controls the network
**Known gap:** Once workspace A caches workspace B's URL, nothing stops A from calling B directly even after the hierarchy changes and A is no longer supposed to reach B. The cached URL remains valid until the container is restarted or the URL changes.
@@ -19,7 +19,7 @@ The provisioner is the platform component that deploys workspace containers and
## Docker Networking (Tier 1-3, Tier 4 uses host)
All workspace containers join the `molecule-monorepo-net` Docker network. Containers are named `ws-{id[:12]}` (first 12 chars of workspace UUID). Two exported helpers in `provisioner` package provide the canonical naming:
All workspace containers join the `molecule-core-net` Docker network. Containers are named `ws-{id[:12]}` (first 12 chars of workspace UUID). Two exported helpers in `provisioner` package provide the canonical naming:
@@ -38,7 +38,7 @@ This URL is pre-stored in both Postgres and Redis before the agent registers. Wh
**Why not use Docker-internal URLs?** In local dev, the platform runs on the host (not in Docker), so it cannot resolve Docker container hostnames. The ephemeral port mapping lets the A2A proxy reach agents via localhost. In production (platform in Docker), the Docker-internal URL (`http://ws-{id}:8000`) would work directly.
**Workspace-to-workspace discovery:** When a workspace discovers another workspace (via `X-Workspace-ID` header on `GET /registry/discover/:id`), the platform returns the Docker-internal URL (`http://ws-{first12chars}:8000`) so containers can reach each other directly on `molecule-monorepo-net`. The internal URL is cached in Redis at provision time and also synthesized as a fallback if the cache misses (only for online/degraded workspaces).
**Workspace-to-workspace discovery:** When a workspace discovers another workspace (via `X-Workspace-ID` header on `GET /registry/discover/:id`), the platform returns the Docker-internal URL (`http://ws-{first12chars}:8000`) so containers can reach each other directly on `molecule-core-net`. The internal URL is cached in Redis at provision time and also synthesized as a fallback if the cache misses (only for online/degraded workspaces).
For external HTTPS access (multi-host mode), Nginx on the host handles TLS termination and proxies to the container.
**Issue:** `getGrouped()` selector creates new objects every call. Not memoized.
**Impact:** Mitigated — `SecretsTab.tsx` wraps the call in `useMemo`, so no active re-render issues in the single consumer. The store-level fix (memoizing `getGrouped` itself) is optional but low priority now.
### 🟡 MEDIUM: Pre-commit Hook Verification
**Issue:** Pre-commit hook checks 'use client' on hook-using components but unclear if it actually fails on violations.
**Action:** Verify the hook is enforcing the rule correctly.
## Verified Findings
### Node Rendering ✅ (with notes)
- **Framework:** `@xyflow/react` (React Flow) — DOM-based, not SVG/Canvas
- **Node selection:** `aria-pressed` + border ring (`border-accent/70`) + shadow
- **Node drag:** React Flow native drag — mouse only, no keyboard alternative yet
- **Node resize:** `NodeResizer` component visible on selected card, keyboard-inaccessible
- **Status:** Accessible via `aria-label` on node cards — "Alpha Workspace workspace — online"
Canvas supports **three themes**: System (follows OS), Light, Dark. Controlled via `ThemeProvider` in `theme-provider.tsx` with preference persisted in `mol_theme` cookie.
**Key principle: Use semantic tokens, NOT raw zinc values for surfaces.**
### 1.1 Theme-Mutable Tokens (use these for surfaces)
Defined in `globals.css` via Tailwind v4 `@theme` block. Automatically flip between light/dark.
@@ -73,7 +73,7 @@ These are applied after CORS middleware on every response.
## 14. No Exposed Database Ports
Postgres and Redis must not expose host ports. They communicate exclusively over the internal Docker network (`molecule-monorepo-net`). Use `docker compose exec` for direct access during development.
Postgres and Redis must not expose host ports. They communicate exclusively over the internal Docker network (`molecule-core-net`). Use `docker compose exec` for direct access during development.
"_comment":"Pin refs to release tags for reproducible builds. 'main' is OK while all repos are internal.",
"_comment":"OSS surface registry — every repo listed here MUST be public on git.moleculesai.app. Layer-3 customer/private templates are NOT registered here; they are handled at provision-time via the per-tenant credential resolver (see internal#102 RFC). 'main' refs are pinned to tags before broad rollout.",
@@ -3,7 +3,7 @@ the workspace auth token is not yet on disk.
Prior behaviour (regressed in #287): `if expected:` skipped the auth
check when `get_token()` returned None, so any container on
`molecule-monorepo-net` could read the full session log during the
`molecule-core-net` could read the full session log during the
bootstrap window. The fix lifts the guard into transcript_auth.py for
testability.
"""
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.