Commit Graph

121 Commits

Author SHA1 Message Date
core-devops 4f4604eabe feat(image): bake molecule-askpass binary for env-driven HTTPS git auth
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 2s
CI / Template validation (static) (pull_request) Successful in 28s
CI / Template validation (static) (push) Successful in 1m10s
CI / Adapter unit tests (push) Successful in 1m17s
CI / T4 tier-4 conformance (live) (push) Failing after 4s
CI / Adapter unit tests (pull_request) Successful in 1m11s
CI / Template validation (runtime) (pull_request) Successful in 3m41s
CI / Template validation (runtime) (push) Successful in 3m33s
CI / T4 tier-4 conformance (live) (pull_request) Successful in 3m44s
CI / validate (push) Failing after 1s
CI / validate (pull_request) Successful in 1s
Image-side companion to molecule-core PR #1525 (merge_sha 73a09443a086,
workspace-server applyAgentGitIdentity). PR #1525 sets GIT_ASKPASS=
/usr/local/bin/molecule-askpass on every workspace container so git can
authenticate to private HTTPS remotes from the persona env vars already
arriving via workspace_secrets — but until this binary ships in the
runtime image, git invocations error with 'exec: /usr/local/bin/
molecule-askpass: not found' (forward-only pin gap).

This is the same class as Hermes list_peers / codex #219: ws-server
changed contract, runtime image hadn't yet caught up. Closing the
image-side gap unblocks Dev-A/Dev-B (claude-code runtime) durable
HTTPS git auth on any private host.

Generic by design — no hardcoded hostnames, no vendor literals. Script
body is identical to workspace/scripts/molecule-askpass in molecule-core
and the parallel external workspace template repos, so any deployer
can fork this template and use it against their own git host without
editing.
2026-05-18 15:05:58 -07:00
devops-engineer 9c2ad2562f Merge pull request 'fix(claude-code): pin publish-image build/push job to Linux publish runner (internal#512)' (#28) from fix/publish-image-pin-linux-publish-runner into main
publish-image / Resolve runtime version (push) Successful in 7s
CI / Template validation (static) (push) Successful in 34s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 5s
CI / Adapter unit tests (push) Successful in 35s
publish-image / Build & push workspace-template-claude-code image (push) Successful in 6m17s
CI / Template validation (runtime) (push) Successful in 4m39s
CI / T4 tier-4 conformance (live) (push) Successful in 4m23s
CI / validate (push) Successful in 2s
2026-05-18 11:31:09 +00:00
core-devops d86c6b7943 fix(claude-code): pin publish-image build/push job to Linux publish runner
CI / validate (push) Blocked by required conditions
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
CI / Adapter unit tests (push) Successful in 1m9s
CI / Template validation (static) (pull_request) Successful in 1m6s
CI / Template validation (static) (push) Successful in 1m7s
CI / Adapter unit tests (pull_request) Successful in 1m13s
CI / T4 tier-4 conformance (live) (push) Failing after 5s
CI / Template validation (runtime) (push) Successful in 1m44s
CI / Template validation (runtime) (pull_request) Successful in 1m44s
CI / T4 tier-4 conformance (live) (pull_request) Successful in 4m39s
CI / validate (pull_request) Successful in 2s
The publish (docker build + ECR push) job used `runs-on: ubuntu-latest`.
That label is advertised by BOTH the Linux self-hosted runners and the
Windows/WSL `hongming-pc-runner-*`. When the job lands on the Windows
runner, the ECR-login step `aws ecr get-login-password | docker login
--password-stdin` fails with "Failed to initialize: protocol not
available", so the image is never published. Non-deterministic placement,
not a transient flake.

Pin the build/push job to `runs-on: publish` (dedicated Linux-only
runners molecule-runner-publish-1/2). The file-read-only resolve-version
job stays on ubuntu-latest. Mirrors molecule-core prior art
(publish-workspace-server-image.yml / publish-runtime.yml /
publish-canvas-image.yml) and the codex sibling fix.

Class defect tracked in molecule-ai/internal#512; reference fix
molecule-ai/molecule-ai-workspace-template-codex#9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 02:54:30 -07:00
devops-engineer bbc2daea4a Merge pull request 'feat(claude-code): T4 host-root escalation leg + real tier-4 conformance gate (RFC internal#456 §9-11)' (#25) from feat/t4-escalation-leg-claude-code into main
CI / Template validation (static) (push) Successful in 1m43s
publish-image / Resolve runtime version (push) Successful in 13s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 13s
CI / Adapter unit tests (push) Successful in 1m50s
CI / Template validation (runtime) (push) Successful in 2m11s
CI / T4 tier-4 conformance (live) (push) Successful in 2m9s
publish-image / Build & push workspace-template-claude-code image (push) Successful in 2m40s
CI / validate (push) Successful in 1s
2026-05-16 20:06:37 +00:00
core-devops 12dd60413d feat(claude-code): T4 host-root escalation leg + real tier-4 conformance gate (RFC internal#456 §9-11)
CI / validate (push) Blocked by required conditions
CI / Template validation (static) (push) Successful in 2m5s
CI / Adapter unit tests (push) Successful in 1m57s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 17s
CI / Template validation (static) (pull_request) Successful in 1m44s
CI / Adapter unit tests (pull_request) Successful in 1m49s
CI / Template validation (runtime) (push) Successful in 12m24s
CI / T4 tier-4 conformance (live) (push) Failing after 12m20s
CI / Template validation (runtime) (pull_request) Successful in 9m27s
CI / T4 tier-4 conformance (live) (pull_request) Successful in 8m59s
CI / validate (pull_request) Successful in 16s
T4 currently ships only the provisioner privileged-container shape;
the in-image uid-1000 agent has NO wired path to host root inside
--privileged --pid=host -v /:/host (--privileged grants caps to root,
not uid-1000; root:docker 0660 docker.sock unusable). This adds the
ADDITIVE escalation leg, preserving the uid-1000 + agent-owned-token
contract:

- Dockerfile: bake sudo + util-linux(nsenter) + docker.io CLI;
  /etc/sudoers.d/agent-t4 `agent ALL=(ALL) NOPASSWD:ALL` (0440,
  visudo-validated at build); `agent` in `docker` group. useradd
  -u 1000 + `exec gosu agent` UNCHANGED — agent stays uid-1000.
- entrypoint.sh: document the agent-owned-token half of the §10
  atomic co-sequencing contract on the existing `chown -R agent
  /configs` (token ownership NOT regressed).
- ci.yml: new `t4-conformance` job — NOT a string-match. Builds the
  real image, runs it under the EXACT controlplane tier-4 flags, and
  asserts on the RUNNING container, atomically: (a) the uid-1000
  agent attains host root (sudo nsenter --target 1 + host-fs
  write/readback through /host) AND (b) /configs/.auth_token
  owner_uid==1000. Wired into the required `validate` aggregator and
  fails closed (no skip except fork-PR short-circuit).

RFC internal#456 §9-11 / PR#474. Atomic per §10: uid-1000 enforcement
and the escalation leg ship in this one image revision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 11:44:43 -07:00
devops-engineer c93214e4e0 Merge pull request 'feat(claude-code): route Kimi K2.6 to api.kimi.com/coding per official spec' (#24) from feat/kimi-k2.6-claude-code-routing into main
publish-image / Resolve runtime version (push) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 15s
CI / Template validation (static) (push) Successful in 1m40s
CI / Adapter unit tests (push) Successful in 1m43s
CI / Template validation (runtime) (push) Successful in 12m22s
publish-image / Build & push workspace-template-claude-code image (push) Successful in 13m22s
CI / validate (push) Successful in 10s
2026-05-16 12:50:17 +00:00
infra-runtime-be 66e3b7edb3 feat(claude-code): route Kimi K2.6 to api.kimi.com/coding per official spec
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 16s
CI / Adapter unit tests (push) Successful in 1m23s
CI / Template validation (static) (push) Successful in 1m27s
CI / Adapter unit tests (pull_request) Successful in 1m25s
CI / Template validation (static) (pull_request) Successful in 1m30s
CI / Template validation (runtime) (push) Successful in 10m27s
CI / Template validation (runtime) (pull_request) Successful in 9m52s
CI / validate (pull_request) Successful in 6s
CI / validate (push) Successful in 5s
Kimi (Kimi-For-Coding / K2.6) was structurally unreachable from the
claude-code runtime: the `kimi-` model prefix matched the `moonshot`
provider, which set ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic
and projected KIMI_API_KEY -> ANTHROPIC_AUTH_TOKEN. Both are wrong per
kimi.com's official Claude Code integration doc
(kimi.com/code/docs/en/third-party-tools/other-coding-agents.html):
  - the sk-kimi-* key (KIMI_API_KEY in SSOT) authenticates ONLY against
    https://api.kimi.com/coding/ — the legacy api.moonshot.ai/anthropic
    surface 401s it (invalid_authentication_error);
  - that gateway authenticates with the x-api-key header, which the
    Anthropic SDK / claude CLI emits from ANTHROPIC_API_KEY, NOT the
    Bearer ANTHROPIC_AUTH_TOKEN.

So a Kimi pick on claude-code 401'd every LLM call.

Fix (config + minimal adapter, scoped to this template — adapter.py and
config.yaml are template-local, COPY'd in the Dockerfile; zero blast
radius on other runtimes):

- config.yaml: repoint the existing kimi- provider entry (renamed
  moonshot -> kimi-coding) to base_url https://api.kimi.com/coding/
  (trailing slash, per the doc) and add a new optional per-provider
  field `auth_token_env: ANTHROPIC_API_KEY` so the boot-time vendor-key
  projection writes KIMI_API_KEY into ANTHROPIC_API_KEY (x-api-key)
  instead of the default ANTHROPIC_AUTH_TOKEN (Bearer). Renaming the
  existing entry (vs adding a parallel one) keeps the kimi- model-prefix
  matcher working with the least change; still 7 providers total.
- config.yaml: add a selectable "Kimi K2.6" model catalog entry
  (id kimi-for-coding — the gateway's own served-model name, mirroring
  the proven OpenClaw kimi-for-coding route; the gateway routes to K2.6
  regardless of the wire model id). kimi-k2.5 / kimi-k2 retained as
  aliases hitting the same gateway for back-compat.
- adapter.py: _normalize_provider parses the optional `auth_token_env`
  (default ANTHROPIC_AUTH_TOKEN — preserves MiniMax/GLM/DeepSeek
  behavior bit-for-bit); _project_vendor_auth projects into that
  per-provider target and is idempotent on it (explicit operator value
  still wins).

Wire-verified before commit: POST https://api.kimi.com/coding/v1/messages
with x-api-key=<SSOT KIMI_API_KEY> + anthropic-version + claude-cli UA
-> HTTP 200, model=kimi-for-coding, real completion. The shipped routing
produces exactly this wire shape.

Tests: added 4 tests (Kimi -> ANTHROPIC_API_KEY projection, operator
override idempotency, _normalize_provider auth_token_env parse,
prevalidate routing matrix incl. kimi-for-coding); updated the
moonshot-named fixtures/assertions to the new kimi-coding contract.
Full suite 85 passed.
2026-05-16 04:56:49 -07:00
devops-engineer 5bc87ea75d Merge pull request 'ci: port secret-scan + publish-image workflows to .gitea/ (T4 close-out)' (#22) from feat/port-secret-scan-and-publish-image-workflows into main
publish-image / Resolve runtime version (push) Successful in 21s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 23s
CI / Template validation (static) (push) Successful in 1m52s
CI / Adapter unit tests (push) Successful in 1m57s
publish-image / Build & push workspace-template-claude-code image (push) Successful in 7m12s
CI / Template validation (runtime) (push) Successful in 9m53s
CI / validate (push) Successful in 25s
2026-05-15 23:28:58 +00:00
infra-sre 73827045bc ci: port secret-scan + publish-image workflows to .gitea/ (T4 close-out) (#22)
Co-authored-by: infra-sre <infra-sre@agents.moleculesai.app>
Co-committed-by: infra-sre <infra-sre@agents.moleculesai.app>
2026-05-15 23:23:47 +00:00
infra-sre 38353e9a4f ci: port secret-scan + publish-image workflows to .gitea/ (T4 close-out)
CI / Adapter unit tests (push) Successful in 1m31s
CI / Template validation (static) (push) Successful in 1m34s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 8s
CI / Template validation (static) (pull_request) Successful in 1m20s
CI / Adapter unit tests (pull_request) Successful in 1m21s
CI / Template validation (runtime) (pull_request) Successful in 14m10s
CI / Template validation (runtime) (push) Successful in 14m54s
CI / validate (pull_request) Successful in 10s
CI / validate (push) Successful in 7s
The .github/workflows/ tree is silently shadowed on this repo because
.gitea/workflows/ exists (reference_molecule_core_actions_gitea_only) —
so both files were never firing on Gitea Actions:

- Secret scan / Scan diff for credential-shaped strings is a required
  status-check on main branch protection; until now it has been satisfied
  only via a compensating signed POST /statuses/{SHA}. Porting restores
  the gate.
- publish-image was dormant, so the claude-code template image stayed
  stale and never rebuilt against new runtime versions. After this port
  the cascade signal (molecule-core/publish-runtime.yml git-pushes
  .runtime-version to main) trips on: push: branches: [main] here and
  pushes ECR :latest + :sha-<7> to
  153263036946.dkr.ecr.us-east-2.amazonaws.com/molecule-ai/workspace-template-claude-code.

Both files copy the canonical Gitea-ported shape verbatim from
molecule-core and molecule-ai-workspace-template-hermes respectively
(only repo-specific identifiers — image name + descriptions — adjusted).
Gitea 1.22.6 hostile-shape constraints already baked in:
  - no workflow_dispatch.inputs (feedback_gitea_workflow_dispatch_inputs_unsupported)
  - no cross-repo uses: (feedback_gitea_cross_repo_uses_blocked)
  - no on.push.paths: (feedback_path_filtered_workflow_cant_be_required)
  - GITHUB_SERVER_URL pinned at workflow level
    (feedback_act_runner_github_server_url)

T4 close-out — Hongming authorized direct merge.
2026-05-15 15:44:47 -07:00
devops-engineer 8bcc19c38e fix(claude-code): chown idempotency + settings.json stub + CLAUDE.md T4 note (#21)
CI / Template validation (static) (push) Successful in 1m21s
CI / Adapter unit tests (push) Successful in 1m28s
CI / Template validation (runtime) (push) Successful in 8m19s
CI / validate (push) Successful in 3s
T4-tier workspace owner permission regression on /home/agent/.claude/ ownership.

Entrypoint now creates well-known subdirs idempotently and runs chown unconditionally. Stubs ~/.claude/settings.json so introspection works. Adds T4 CLAUDE.md note documenting host-control semantics + new MCP tool surface (get_runtime_identity / update_agent_card — tools land via molecule-core monorepo route, not this template).

CI: 8/8 green.
Compensating Secret-scan status posted by core-devops review #3874 (workflow file only present in .github/, which is shadowed by .gitea/ on this repo). Follow-up: port secret-scan.yml to .gitea/workflows/.

Reviewed-by: core-devops
Merged-by: devops-engineer (BP merge whitelist)
2026-05-15 21:47:08 +00:00
fullstack-engineer 47263db7ad fix(claude-code): chown idempotency + settings.json stub + T4 ownership note
CI / Template validation (static) (push) Successful in 1m12s
CI / Adapter unit tests (push) Successful in 1m19s
CI / Adapter unit tests (pull_request) Successful in 1m16s
CI / Template validation (static) (pull_request) Successful in 1m18s
CI / Template validation (runtime) (push) Successful in 6m15s
CI / Template validation (runtime) (pull_request) Successful in 5m24s
CI / validate (push) Successful in 5s
CI / validate (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Manual scan — no credential-shaped strings in diff. Workflow exists only at .github/workflows/secret-scan.yml; this repo uses .gitea/workflows/ so workflow does not fire. Filed by core-devops review #3874 with audit trail.
Closes the three template-side gaps in the T4-tier workspace owner
permission report:

1. entrypoint.sh chown idempotency.
   The chown of /home/agent/.claude was previously only fired inside
   the `if [ -d /root/.claude/sessions ]` guard. On first boot that's
   harmless — entrypoint creates the dir and the chown lands. But on
   second boot with a populated host volume (which T4 always has,
   because the workspace dir is bind-mounted for persistence) the dir
   may already be root-owned from a prior boot or from a newer
   claude-code release writing subdirs the entrypoint didn't pre-create.
   Result: uid-1000 agent EPERMs on every settings/session write,
   surfaced to the canvas as a generic Bash "permission restrictions"
   failure. Fix: pre-create sessions/ and session-env/, and run the
   chown unconditionally — idempotent + fast on small trees.

2. ~/.claude/settings.json stub.
   The Dockerfile + entrypoint never created this file. The agent's
   `cat ~/.claude/settings.json` correctly reported "No such file or
   directory" and the agent then assumed the workspace had no operating
   mode. Stub a minimal informational settings.json documenting that
   permission_mode='bypassPermissions' is the canonical mode (set
   programmatically in claude_sdk_executor.py — the file is NOT the
   source of truth, the SDK kwargs are). Idempotent: existing file is
   left alone.

3. CLAUDE.md — T4 ownership documentation.
   Add a "Workspace ownership tier — T4" section so the agent knows
   it has full host control and how to recover from EPERM if the
   ownership ever drifts. Add a "Knowing your own model" section
   pointing at the new `get_runtime_identity` MCP tool (shipped in
   molecule-ai-workspace-runtime 0.1.18) and an "Editing your own
   agent_card" section pointing at the new `update_agent_card` MCP
   tool.

Test plan:
- sh -n + bash -n on entrypoint.sh → syntax OK.
- Idempotency probe: ran the chown/mkdir/stub fragment twice on a
  scratch tmpdir; second run does NOT overwrite a tampered
  settings.json, dirs already-existing is a `mkdir -p` no-op.
- pytest tests/ → 81 passed (baseline maintained).

Follow-up:
- Bump .runtime-version to 0.1.18 in a follow-up PR after the runtime
  wheel hits PyPI via the publish workflow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 14:28:57 -07:00
claude-ceo-assistant 43a86d44da Merge pull request 'fix(ci): port CI/validate to .gitea/ + inline (closes main-red)' (#17) from infra/main-red-fix-ci-validate into main
CI / Template validation (static) (push) Successful in 1m29s
CI / Adapter unit tests (push) Successful in 1m47s
CI / Template validation (runtime) (push) Successful in 8m55s
CI / validate (push) Successful in 4s
2026-05-11 19:53:44 +00:00
core-devops c2a0bdea96 fix(ci): port CI/validate to .gitea/ + inline (closes main-red)
CI / Template validation (static) (push) Successful in 1m7s
CI / Adapter unit tests (push) Successful in 1m26s
CI / Template validation (static) (pull_request) Successful in 1m10s
CI / Adapter unit tests (pull_request) Successful in 1m12s
CI / Template validation (runtime) (pull_request) Successful in 6m10s
CI / Template validation (runtime) (push) Successful in 7m35s
CI / validate (push) Successful in 7s
CI / validate (pull_request) Successful in 5s
Class-A root fix for internal#326 (main-red sweep). The .github/ci.yml
used cross-repo `uses:` to molecule-ci/.github/workflows/validate-workspace-template.yml@main,
which Gitea 1.22.6 rejects (DEFAULT_ACTIONS_URL=github → 404, per
feedback_gitea_cross_repo_uses_blocked). Because Gitea 1.22.6 reads
.github/ as a fallback when .gitea/ is absent
(reference_per_repo_gitea_vs_github_actions_dir), the .github/ workflow
was firing and failing at parse time in 1s.

Fix: inline the validate-workspace-template logic directly. The canonical
validator in molecule-ci already self-clones into the runner via
`git clone --depth 1 https://git.moleculesai.app/molecule-ai/molecule-ci.git`,
so the inline port preserves single-source-of-truth — every CI run still
fetches the canonical validator script fresh.

Shape preserved from the source workflow:
  - validate-static (always runs, including fork PRs): secret-scan +
    --static-only validator
  - validate-runtime (skipped on fork PRs for security): pip install
    requirements.txt + import adapter.py + docker build smoke test
  - validate (aggregator): emits the single `validate` check name that
    historically gates branch protection
  - tests: per-repo adapter unit tests (preserved verbatim from
    .github/ci.yml)

Gitea 1.22.6 compat additions:
  - env.GITHUB_SERVER_URL=https://git.moleculesai.app (workflow-level
    belt-and-suspenders per feedback_act_runner_github_server_url)
  - permissions: contents: read (defense-in-depth on GITHUB_TOKEN scope,
    matching the source workflow_call's permission posture)
  - actions/checkout pinned to SHA (v6.0.2) per molecule-core canonical
    port style

The .github/ original is preserved verbatim for future GitHub-mirror
compatibility (no behaviour change there).

Refs: internal#326
2026-05-11 12:30:26 -07:00
claude-ceo-assistant d2585700f5 fix(adapter): mirror provider alias map onto YAML path (#12)
Secret scan / Scan diff for credential-shaped strings (push) Successful in 4s
CI / Adapter unit tests (push) Successful in 1m21s
CI / validate (push) Failing after 2m9s
[FORCE-MERGE AUDIT — §SOP-7] hongming chat-go ("do both") in transcript ~03:54 UTC 2026-05-10. Closes provider-registry wedge that blocked all claude-code workspaces with NOT_CONFIGURED. Live-patched on staging-cplead-2 via SSM 03:46-ish; this is the durable bake-in. 81 tests pass + 3 new regression tests.
2026-05-10 03:51:28 +00:00
Claude CEO Assistant aaa2a79e81 fix(adapter): alias-map yaml_provider for runtime-wheel default
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
CI / Adapter unit tests (push) Successful in 1m21s
CI / Adapter unit tests (pull_request) Successful in 1m18s
CI / validate (pull_request) Failing after 2m15s
CI / validate (push) Failing after 5m36s
The molecule-runtime wheel auto-derives `runtime_config.provider =
"anthropic"` from its default model slug `anthropic:claude-opus-4-7`
when the per-workspace YAML omits both fields. The adapter receives
that derived `anthropic` as `yaml_provider` and rejects it because the
providers registry only knows `anthropic-oauth` / `anthropic-api`. The
existing alias map (`anthropic` → `anthropic-api`,
`claude-code` → `anthropic-oauth`) was applied only on the env-var
path; mirroring it on the YAML path resolves the wheel default to a
registered provider name.

Symptom on staging-cplead-2 (2026-05-09): every workspace booted with
`configuration_status=not_configured` and
`configuration_error="ValueError: claude-code adapter: workspace
config picks provider='anthropic' but it is not in the providers
registry"`. Live-patched the running cp-lead workspaces to confirm the
fix; this commit lands the durable change in the template repo so
freshly-provisioned workspaces don't repeat the wedge.

Tests:
  - test_yaml_provider_anthropic_is_aliased_to_anthropic_api (regression)
  - test_yaml_provider_claude_code_is_aliased_to_anthropic_oauth (symmetry)
  - test_yaml_provider_unknown_passes_through_for_actionable_error
    (guards the silent-fallback bug from #180; unaliased unknowns must
    still reach _resolve_provider so it raises with the helpful
    "Known providers: ..." message)

All 81 tests pass locally.

Refs: staging-cplead-2 incident 2026-05-09
Live-patched workspaces: 941a929e, 99de7cab, a8ba9dc8, a00e74df

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:46:02 -07:00
claude-ceo-assistant 4b038f2947 Merge pull request 'fix(adapter): map persona-friendly slugs (claude-code, anthropic) to registry names' (#10) from fix/dispatch-alias-map-followup into main
Secret scan / Scan diff for credential-shaped strings (push) Successful in 37s
CI / Adapter unit tests (push) Failing after 12m10s
CI / validate (push) Failing after 17m11s
2026-05-08 21:24:27 +00:00
claude-ceo-assistant 8adc3576fd fix(adapter): map persona-friendly slugs (claude-code, anthropic) to registry names
CI / Adapter unit tests (push) Successful in 1m46s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 50s
CI / Adapter unit tests (pull_request) Successful in 2m16s
CI / validate (pull_request) Successful in 6m18s
CI / validate (push) Failing after 18m56s
Phase 4 verification surfaced a follow-up edge case the initial fix missed:
the persona env files use friendlier slugs than the registry's canonical names:
  * MODEL_PROVIDER=claude-code  -> anthropic-oauth (Claude Code subscription)
  * MODEL_PROVIDER=anthropic    -> anthropic-api  (direct Anthropic API key)

Without an alias map, a lead workspace's MODEL_PROVIDER=claude-code env
fell through the slug-detection path; when the YAML didn't pin a
provider, the model-prefix matcher saw MODEL=MiniMax-M2.7 and routed the
lead to MiniMax — even though CLAUDE_CODE_OAUTH_TOKEN was clearly the
intended auth path.

Add _PROVIDER_SLUG_ALIASES with the two operator-facing slugs that don't
match registry names verbatim. The alias map is consulted before the
slug-vs-legacy detection, so claude-code now resolves to anthropic-oauth
and the lead boots through OAuth as intended.

Tests
-----
+ test_persona_env_lead_with_minimax_model_routes_via_oauth — lock in
  the alias-map behavior so a future contributor can't silently re-introduce
  the lead-mis-routed-to-MiniMax bug.
+ test_anthropic_alias_resolves_to_anthropic_api — covers the second
  alias path.

Updated test_persona_env_lead_claude_code_resolves_correctly to assert
the new (correct) behavior: provider == 'anthropic-oauth', not None.

Full adapter suite: 78/78 pass.
2026-05-08 14:23:59 -07:00
claude-ceo-assistant 134ba7f82c fix(adapter): honor MODEL/MODEL_PROVIDER env (persona-env convention) (#9)
Secret scan / Scan diff for credential-shaped strings (push) Successful in 16s
CI / Adapter unit tests (push) Failing after 37s
CI / validate (push) Failing after 50s
Fix 2026-05-08 dev-tree wedge: 22/27 non-lead workspaces stuck at SDK initialize timeout because MODEL_PROVIDER=minimax was read as model id instead of provider slug.
2026-05-08 21:12:21 +00:00
claude-ceo-assistant 1742b60e62 fix(adapter): honor MODEL/MODEL_PROVIDER env (persona-env convention)
CI / Adapter unit tests (push) Successful in 1m40s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s
CI / Adapter unit tests (pull_request) Failing after 52s
CI / validate (push) Failing after 2m17s
CI / validate (pull_request) Successful in 13m19s
Fixes the 2026-05-08 dev-tree wedge: 22/27 non-lead workspaces (minimax tier)
stuck in degraded after /org/import, every chat hanging on
`Control request timeout: initialize`.

Root cause
----------
The persona env files (`~/.molecule-ai/personas/<name>/env`) declare a TWO-
variable convention:
  - MODEL          = model id   ("MiniMax-M2.7-highspeed")
  - MODEL_PROVIDER = provider slug ("minimax")

The runtime wheel's legacy `workspace/config.py` interprets MODEL_PROVIDER
as the *model id* — a name chosen long before there was a separate MODEL
env. With both set, the legacy code reads MODEL_PROVIDER="minimax" into
runtime_config.model. The literal string "minimax" doesn't match any
registry prefix (`minimax-` requires a hyphen suffix), falls through to
providers[0] (anthropic-oauth), the auth check fails on the absent
CLAUDE_CODE_OAUTH_TOKEN, the claude CLI launches anyway, and the SDK's
`query.initialize()` 60s control timeout fires.

The brief hypothesised `claude_sdk_executor.py` lacked dispatch logic.
Phase 1 evidence: dispatch ALREADY exists in adapter.py — model -> provider
-> base_url + auth_env routing was correctly built for #180. The bug was
upstream: MODEL_PROVIDER's name collision with the persona-env convention
silently corrupted the picked model BEFORE adapter.py saw it.

Fix
---
New helper `_resolve_model_and_provider_from_env` reconciles env vars
against YAML inside adapter.setup() and create_executor():

  1. MODEL env -> picked_model (authoritative when set).
  2. MODEL_PROVIDER env -> explicit_provider IFF the value matches a
     registered provider name. Backward-compat: if MODEL is unset and
     MODEL_PROVIDER doesn't match a registered slug, treat it as a
     legacy model id (canvas Save+Restart pre-this-fix).
  3. YAML runtime_config.{model,provider} fills any field env didn't
     supply.

Contained in the template repo per the brief's scope guidance — does NOT
touch the runtime wheel's workspace/config.py (which would need a separate
molecule-core PR), and does NOT change the persona-env dispatch policy
(Phase 2 mapping 2026-05-08).

Tests
-----
Eleven new cases in tests/test_env_model_provider_dispatch.py covering:
  - persona-env shape (minimax, GLM, lead claude-code) -> correct model + slug
  - legacy MODEL_PROVIDER-as-model-id shape still works
  - env wins over YAML
  - YAML fallback when env unset
  - whitespace/empty defensive handling
  - case-insensitive provider slug matching

Full adapter test suite: 76/76 pass.

Verification path
-----------------
After image rebuild + workspace re-provision, ws-* containers will boot
with provider=minimax (not anthropic-oauth), ANTHROPIC_BASE_URL set to
https://api.minimax.io/anthropic, MINIMAX_API_KEY projected onto
ANTHROPIC_AUTH_TOKEN, and the SDK init handshake succeeding.

Refs: task #181, brief 2026-05-08, related #180 (#7 in this repo)
2026-05-08 14:11:42 -07:00
claude-ceo-assistant 56a045f38e Merge pull request 'fix(adapter,tests): isolate _load_providers tests from multi-path lookup' (#8) from fix/load-providers-tests-isolate-multipath into main
Secret scan / Scan diff for credential-shaped strings (push) Successful in 3s
CI / Adapter unit tests (push) Successful in 54s
CI / validate (push) Successful in 3m6s
2026-05-08 20:28:14 +00:00
dev-lead 291f356dab fix(adapter,tests): isolate _load_providers tests from multi-path lookup
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
CI / Adapter unit tests (push) Successful in 1m1s
CI / Adapter unit tests (pull_request) Successful in 1m2s
CI / validate (push) Successful in 3m23s
CI / validate (pull_request) Successful in 3m22s
The 5 _load_providers tests were single-path-only: they wrote a
config.yaml to tmp_path and called _load_providers(str(tmp_path)),
expecting the lookup to read tmp_path/config.yaml.

After the multi-path fix in #7, _load_providers also checks:
  1. _CANONICAL_ADAPTER_DIR/config.yaml  (= /opt/adapter/config.yaml)
  2. _TEMPLATE_DIR/config.yaml           (= dirname(__file__)/config.yaml)
  3. ${config_path}/config.yaml          (the test's tmp_path)

Path 2 finds the repo's bundled config.yaml on the test runner's
disk before path 3 — the tests then see the bundled providers list
instead of the test's expected behavior.

Two surface changes:

  1. adapter.py — extract `os.path.dirname(os.path.abspath(__file__))`
     into a module-level `_TEMPLATE_DIR` constant, mirroring
     `_CANONICAL_ADAPTER_DIR`. Production behavior identical
     (resolved once at import). Tests can monkeypatch the module
     attribute to redirect the path-2 lookup.

  2. tests/test_adapter_prevalidate.py — 5 _load_providers tests
     monkeypatch `_CANONICAL_ADAPTER_DIR` and `_TEMPLATE_DIR` to a
     non-existent tmp subdir, isolating the test to the workspace
     config_path branch they always meant to test.

The 6th _load_providers test (`test_load_providers_parses_yaml_and_normalizes`)
already passed because path 2 returns 7 providers and that's what
that test expects — left unchanged.

Verification:
  pytest tests/                                 65/65 PASS
  pytest tests/test_adapter_prevalidate.py -k load_providers
                                                  6/6 PASS

Closes molecule-core#129 follow-up — the unit tests were the last
red on the template repo's CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:27:56 -07:00
claude-ceo-assistant 91022654cd Merge pull request 'fix(adapter): restore multi-path _load_providers (closes molecule-core#129 failure mode #1)' (#7) from fix/load-providers-multipath-restore into main
Secret scan / Scan diff for credential-shaped strings (push) Successful in 5s
CI / Adapter unit tests (push) Failing after 1m5s
CI / validate (push) Successful in 3m9s
2026-05-08 20:12:37 +00:00
dev-lead b96a6d2569 fix(adapter): restore multi-path _load_providers (canonical + template + workspace)
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
CI / Adapter unit tests (pull_request) Failing after 59s
CI / Adapter unit tests (push) Failing after 1m6s
CI / validate (pull_request) Successful in 3m22s
CI / validate (push) Successful in 3m21s
The template's _load_providers had only ONE lookup path
(${config_path}/config.yaml = /configs/config.yaml) — which is the
per-workspace override, NOT the template's bundled provider registry.
Every MiniMax/GLM/Kimi/DeepSeek model resolved to anthropic-oauth
and crashed at first LLM call:

  None of CLAUDE_CODE_OAUTH_TOKEN set for model=MiniMax-M2.7-highspeed
    (provider=anthropic-oauth) — the adapter will fail on the first
    LLM call with AuthenticationError
  ...
  probed_cli_error='Not logged in · Please run /login'

Canary chronic red 38h+ on 2026-05-07/08 traced to this. The fix
that the May-4 image already had bundled — a 4-path lookup with
canonical /opt/adapter/config.yaml + __file__-adjacent + workspace
override + builtins fallback — was never on Gitea main, so post-
suspension rebuilds dropped it. Restoring here.

Resolution order:
  1. /opt/adapter/config.yaml (canonical, provisioner-contracted)
  2. dirname(__file__)/config.yaml (covers /app/config.yaml from
     Dockerfile #6 as well as dev/test imports)
  3. ${config_path}/config.yaml (per-workspace override)
  4. _BUILTIN_PROVIDERS (oauth + anthropic-api fallback)

Verified locally: ps=_load_providers('/nonexistent') returns the
7 providers from /tmp/cctmpl/config.yaml via path 2 (the
__file__-adjacent lookup). Without the fix, returns 2 (builtins).

Closes molecule-core#129 failure mode #1 (the original "Agent error
(Exception)" 38h chronic red).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:12:24 -07:00
claude-ceo-assistant 2edd78c154 Merge pull request 'fix(dockerfile): bundle config.yaml into /app so providers registry loads' (#6) from fix/dockerfile-bundle-config-yaml into main
Secret scan / Scan diff for credential-shaped strings (push) Successful in 4s
CI / Adapter unit tests (push) Successful in 57s
CI / validate (push) Successful in 3m14s
2026-05-08 18:19:10 +00:00
dev-lead ad4241cebb fix(dockerfile): bundle config.yaml into /app so providers registry loads
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
CI / Adapter unit tests (push) Successful in 55s
CI / Adapter unit tests (pull_request) Successful in 1m0s
CI / validate (pull_request) Successful in 3m10s
CI / validate (push) Successful in 3m10s
The adapter's _load_providers tries 4 paths in order:
  1. /opt/adapter/config.yaml  — provisioner-managed (currently missing)
  2. os.path.dirname(__file__)/config.yaml  — alongside adapter.py
  3. ${WORKSPACE_CONFIG_PATH}/config.yaml  — workspace overrides
  4. _BUILTIN_PROVIDERS  — oauth + anthropic-api only

On this template's docker image /opt/adapter/ is never populated by
the platform provisioner (verified 2026-05-08 by SSM-exec on a live
canary's workspace EC2: ls /opt/adapter/ → no such file or directory).
That makes path 2 — the dir adjacent to /app/adapter.py — the
load-bearing one for production workloads.

The Dockerfile copies adapter.py + claude_sdk_executor.py + scripts/
+ entrypoint.sh + __init__.py into /app, but it does NOT copy
config.yaml. So /app/config.yaml doesn't exist, path 2 fails, and
the adapter falls all the way through to _BUILTIN_PROVIDERS.

_BUILTIN_PROVIDERS contains only anthropic-oauth + anthropic-api.
Every MiniMax / GLM / Kimi / DeepSeek model id has no matching
prefix in those two, so _resolve_provider returns providers[0] =
anthropic-oauth (per "unknown ids fall back to providers[0]" rule).
That provider needs CLAUDE_CODE_OAUTH_TOKEN, which is unset for
non-OAuth tenants. The claude CLI fails with:
  Not logged in · Please run /login

…which surfaces in the A2A response as "Agent error (Exception)".

This is the root cause of:
  • Canary chronic red since 2026-05-07 02:30 UTC (38h+ at time of
    investigation)
  • molecule-core#129 failure mode #1
  • Memory feedback_template_vs_workspace_config_separation
    (template-claude-code PR #37 added the multi-path lookup but
    didn't bundle config.yaml into the image — the lookup paths
    point at files that don't exist)

Fix: one-line `COPY config.yaml .` in the Dockerfile.

Verification path (post-merge): publish-runtime workflow rebuilds
the image, deploys to staging tenant fleet, next canary cron run
sees /app/config.yaml → loads minimax provider → MINIMAX_API_KEY
matches → claude CLI auths → A2A returns PONG → green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:15:39 -07:00
claude-ceo-assistant 3c849b3ba7 Merge pull request 'fix(adapter): honor explicit provider config — fail fast when not in registry (#180)' (#4) from fix/180-explicit-provider-validation into main
Secret scan / Scan diff for credential-shaped strings (push) Successful in 4s
CI / Adapter unit tests (push) Successful in 1m14s
CI / validate (push) Successful in 4m25s
v1.0.0
2026-05-07 18:09:01 +00:00
claude-ceo-assistant f8d7f8f3a8 test(adapter): install adapter import shims via conftest
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
CI / Adapter unit tests (push) Successful in 58s
CI / Adapter unit tests (pull_request) Successful in 58s
CI / validate (pull_request) Successful in 2m59s
CI / validate (push) Successful in 3m0s
CI runner installs only `pytest pytest-asyncio pyyaml`; without the
molecule_runtime/a2a/claude_sdk_executor stubs, the new
test_provider_resolution.py fails to collect with
ModuleNotFoundError. test_adapter_prevalidate.py owned the same
shims via a per-file _install_stubs(), but two files maintaining
parallel stub copies eventually disagree on shape (BaseAdapter
needing install_plugins_via_registry, etc.).

Move the shim install + sys.path bump into tests/conftest.py so
every test module shares a single canonical stub set, collected
before any test imports adapter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 10:58:51 -07:00
claude-ceo-assistant a2c7bf3d3b fix(adapter): honor explicit provider config — fail fast when not in registry (#180)
Workspace operators set 'provider: minimax' in /configs/config.yaml
expecting the adapter to route to MiniMax. Pre-fix behavior: adapter
ignored 'provider:' entirely, _resolve_provider model-matched against
_BUILTIN_PROVIDERS (anthropic-oauth + anthropic-api only), no model_prefix
matched 'MiniMax-M2.7-highspeed', silent fallback to providers[0]
(anthropic-oauth) — SDK kept using CLAUDE_CODE_OAUTH_TOKEN, hit OAuth
quota under a name the operator never asked for.

Fix: _resolve_provider now takes an explicit_provider arg. setup() reads
it from runtime_config.provider OR top-level config.yaml provider:.
Explicit name in registry → returned. Not in registry → ValueError with
the two paths to fix (add provider entry, or switch runtime template).

10 new tests cover: explicit-in-registry returns match, case-insensitive,
not-in-registry raises with actionable message, defense-in-depth against
silent fallback regression, custom-registry lookup, empty/None treated as
no-explicit (back-compat).

Closes #180.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 10:58:51 -07:00
claude-ceo-assistant a5c9acd950 Merge pull request 'chore(ci): adopt .runtime-version push-mode cascade signal' (#3) from chore/runtime-version-file into main
CI / Adapter unit tests (push) Successful in 20s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 3s
CI / validate (push) Successful in 11m48s
2026-05-07 10:12:38 +00:00
claude-ceo-assistant 3e491c673b chore(ci): adopt .runtime-version push-mode cascade signal
CI / Adapter unit tests (pull_request) Successful in 21s
CI / validate (push) Successful in 11m50s
CI / validate (pull_request) Successful in 11m38s
CI / Adapter unit tests (push) Successful in 20s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
Background: post-2026-05-06 SCM is Gitea, not GitHub. Gitea 1.22.6 has
no repository_dispatch / workflow_dispatch trigger API (empirically
verified across 6 candidate paths in molecule-core#20 issuecomment-913).
The molecule-core/publish-runtime.yml cascade therefore cannot fire
templates via curl-dispatch — pivots to push-mode instead.

This PR is the consumer side of that pivot:

- `.runtime-version` file at repo root — single line, plain version
  string. Currently 0.1.129 (latest published as of 2026-05-07).
  publish-runtime overwrites this on each cascade.

- publish-image.yml gains a `resolve-version` job that reads the file
  and forwards the value to the reusable build workflow as the
  third-priority source in the resolution chain:
    1. client_payload.runtime_version (forward-compat with future
       GitHub-style dispatch if Gitea ever adds it)
    2. inputs.runtime_version (manual workflow_dispatch override)
    3. .runtime-version file (push-mode cascade — the new path)
    4. '' (Dockerfile requirements.txt default)

No behavioural change for PRs / manual dispatches; only fills in the
on-push case where previously the version was empty.

Sequencing context: this PR (and 8 sibling PRs to the other template
repos) MUST land before molecule-core#20 v2 is merged — otherwise the
first cascade push would trigger an on-push rebuild that pins the OLD
requirements.txt floor instead of the freshly-published version.

Refs molecule-core#14, molecule-core#20, molecule-core/issues/20.
2026-05-07 03:03:02 -07:00
security-auditor 91e5010888 ci: re-trigger after orchestrator restarted runners 1-8
Secret scan / Scan diff for credential-shaped strings (push) Successful in 9s
CI / Adapter unit tests (push) Successful in 50s
CI / validate (push) Successful in 12m11s
Per saved memory feedback_runner_config_partial_deploy: orchestrator
identified that runners 1-8 last restarted before AGENT_TOOLSDIRECTORY
+ RUNNER_TOOL_CACHE were added; cycle 7 retrigger landed ~50% on stale
runners. Orchestrator restarted 1-8 at ~09:37; this empty commit
re-triggers CI on the now-consistent runner pool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 02:40:53 -07:00
security-auditor b91f1ab694 fix(ci): inline secret-scan body, drop cross-repo uses: of private molecule-core
Secret scan / Scan diff for credential-shaped strings (push) Successful in 7s
CI / Adapter unit tests (push) Failing after 16s
CI / validate (push) Failing after 18s
The 3-line wrapper at .github/workflows/secret-scan.yml referenced
`uses: molecule-ai/molecule-core/.github/workflows/secret-scan.yml@staging`.
molecule-core is private; act_runner clones cross-repo reusable
workflows anonymously, so the resolve fails at 0s with no logs.

Same root cause + same fix that molecule-controlplane already shipped
(see its secret-scan.yml comment block lines 10-22). Inlining keeps
the gate functional until Gitea is upgraded or the canonical scanner
moves to a public repo. When either lands, this file reverts to the
3-line wrapper.

Refs: internal#46 Phase 3 Class 2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 02:29:04 -07:00
security-auditor cd68aae474 ci: re-trigger after runner-config v2 (AGENT_TOOLSDIRECTORY etc.)
Secret scan / secret-scan (push) Failing after 0s
CI / Adapter unit tests (push) Failing after 15s
CI / validate (push) Failing after 18s
Empty commit to re-run CI against the act_runner config that landed
in /opt/molecule/runners/config.yaml (cycle ~58 internal#46 Phase 3).
No source change. CI now runs setup-python with /tmp/hostedtoolcache,
which works (verified in cycle 6 task 1022 log, careful-bash#2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 02:27:50 -07:00
claude-ceo-assistant f549d0e4f3 Merge pull request 'docs(install): migrate git clone URL to git.moleculesai.app (#37)' (#1) from fix/install-path-gitea into main
Secret scan / secret-scan (push) Failing after 0s
CI / validate (push) Failing after 11s
CI / Adapter unit tests (push) Successful in 18s
2026-05-07 09:24:04 +00:00
claude-ceo-assistant 09c95308fd Merge pull request 'fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs' (#2) from fix/lowercase-org-slug into main
Secret scan / secret-scan (push) Failing after 0s
CI / Adapter unit tests (push) Failing after 17s
CI / validate (push) Failing after 23s
2026-05-07 08:59:12 +00:00
security-auditor fb450b0758 fix(ci): lowercase 'molecule-ai/' in cross-repo workflow refs
CI / validate (pull_request) Failing after 0s
Secret scan / secret-scan (pull_request) Failing after 0s
CI / validate (push) Failing after 0s
CI / Adapter unit tests (push) Failing after 13s
CI / Adapter unit tests (pull_request) Failing after 13s
Gitea is case-sensitive on owner slugs; canonical is lowercase
`molecule-ai/...`. Mixed-case `Molecule-AI/...` refs fail-at-0s
when the runner tries to resolve the cross-repo workflow / checkout.

Same fix as molecule-controlplane#12. Mechanical case-correction;
no behavior change beyond making CI resolve again.

Refs: internal#46

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 00:59:45 -07:00
documentation-specialist e28c2d0fd7 docs(install): migrate git clone URL to git.moleculesai.app (#37)
CI / Adapter unit tests (push) Failing after 10s
CI / Adapter unit tests (pull_request) Failing after 10s
CI / validate (pull_request) Failing after 0s
Secret scan / secret-scan (pull_request) Failing after 0s
CI / validate (push) Failing after 0s
One anonymous git-clone ref in runbooks/local-dev-setup.md:27.
Public repo, no auth-shape change.

Refs: molecule-ai/internal#37, molecule-ai/internal#38
2026-05-07 00:31:16 -07:00
Hongming Wang 50e16c5c73 Merge pull request #33 from Molecule-AI/feat/per-vendor-env-routing-task-244
CI / validate (push) Failing after 0s
Secret scan / secret-scan (push) Failing after 0s
CI / Adapter unit tests (push) Failing after 7s
feat: per-vendor env routing for third-party providers (task #244)
2026-05-02 22:24:03 -07:00
Hongming Wang b9a1fa1b1f feat: per-vendor env routing for third-party providers (task #244)
CI / validate (push) Failing after 0s
CI / Adapter unit tests (push) Failing after 6s
Third-party Anthropic-compat providers (MiniMax, GLM, Kimi, DeepSeek)
all reuse the Anthropic SDK's wire format, which means the claude CLI
and claude-code-sdk read the bearer token from ANTHROPIC_AUTH_TOKEN no
matter which vendor is being talked to. Pre-#244:

  * Canvas surfaced the vendor-specific name (MINIMAX_API_KEY, etc.)
    to the user — so a user who saved only MINIMAX_API_KEY hit a
    silent 401 on first call.
  * The boot audit said `MINIMAX_API_KEY=set`, making it look like an
    SDK bug rather than a routing gap.
  * A user with multiple vendor keys could only run one workspace at a
    time because they all fought over the shared ANTHROPIC_AUTH_TOKEN
    slot.

Diagnostic-only audit logging shipped earlier (#32) but the actual
routing was never written — task #244 was mismarked complete.

Changes:
  * config.yaml: third-party model `required_env` now references the
    per-vendor name (MINIMAX_API_KEY, GLM_API_KEY, KIMI_API_KEY,
    DEEPSEEK_API_KEY) so canvas asks the user for the right key.
    First-party Anthropic models still use ANTHROPIC_AUTH_TOKEN /
    CLAUDE_CODE_OAUTH_TOKEN.
  * config.yaml: each third-party provider's `auth_env` lists the
    vendor name FIRST (priority order) so projection picks the
    vendor key over a stale ANTHROPIC_AUTH_TOKEN.
  * adapter.py: new `_project_vendor_auth(provider)` helper, called
    from `setup()` right after `_resolve_provider`. Idempotent — only
    projects when ANTHROPIC_AUTH_TOKEN is unset (operator override
    always wins). Logs the projection by NAME, never by VALUE
    (mirrors `_audit_auth_env_presence`).
  * tests/test_provider_routing.py: 6 new tests pin the contract —
    vendor-key-set projects, AUTH_TOKEN-already-set is never
    clobbered, first-party providers skip projection, secret value
    never leaks into a log record, empty-string vendor env doesn't
    trigger projection, and the same routing fires for GLM / Kimi /
    DeepSeek.

Mirrors the parallel hermes-side fix from task #249 / hermes PR #38;
keeps the two runtimes' multi-vendor UX in lockstep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 22:20:03 -07:00
Hongming Wang c09a268ae4 Merge pull request #32 from Molecule-AI/fix/canvas-picked-model-and-boot-debug-logging
fix: canvas-picked model + boot debug logging + restore claude_sdk_executor.py
2026-05-02 21:44:33 -07:00
Hongming Wang 78ae139609 feat(adapter,entrypoint): boot env audit + crash-loop diagnosis logging
Adds two operator-visible boot diagnostics that close the diagnosis gap
exposed by the 2026-05-02 MiniMax E2E crash-loop. The universal
canvas-picked-model fix (Bug B) and per-model required_env (Bug D) live
in molecule-core PR #2538 — this PR adds the per-template visibility
that complements them so operators can answer "is the key missing or is
routing wrong?" from `docker logs` alone.

Changes
-------
adapter.py:
- _AUTH_ENV_AUDIT tuple of 8 vendor env names (CLAUDE_CODE_OAUTH_TOKEN,
  ANTHROPIC_API_KEY/AUTH_TOKEN/BASE_URL, MINIMAX/GLM/KIMI/DEEPSEEK_API_KEY).
- _audit_auth_env_presence() helper — single INFO line of NAME=set/unset
  pairs. NEVER logs values; the test pins this with a "fake-secret-MUST-
  NOT-LEAK" sentinel that must never appear in the log message.
- One call site at the end of setup()'s boot banner so every workspace
  start emits both "which provider got picked" and "which envs are present"
  in adjacent log lines.

entrypoint.sh:
- log_boot_context() function fired once before the gosu drop (as root)
  and once after (as agent) so an operator can spot env values lost
  across the privilege drop. Emits uid/gid/user/hostname/workspace_id/
  platform_url/configs_dir/workspace_dir + the same 8 env names as
  NAME=set/unset. Mirror of _AUTH_ENV_AUDIT — list pinned in sync by a
  new AST-style test (test_audit_env_list_matches_entrypoint_sh) that
  parses entrypoint.sh and asserts set-equality with adapter.py's tuple.

tests/test_adapter_logging.py (new):
- 4 tests covering the audit contract: every name appears, all-unset
  scenario, empty-string treated as unset (matches routing semantics),
  and the cross-file sync gate against entrypoint.sh's for-loop.
- Stubs molecule_runtime + a2a so the helpers can be imported without
  the real wheel installed in CI (mirrors test_adapter_prevalidate.py's
  scaffolding pattern).

Why this complements molecule-core PR #2538
-------------------------------------------
- PR #2538 makes Bug B (canvas-picked model silently dropped) impossible
  by resolving model centrally in workspace/config.py:load_config —
  every adapter (claude-code, hermes, codex, future ones) gets the
  passthrough for free.
- PR #2538 makes Bug D (preflight rejects valid auth for non-default
  models) impossible by REPLACE-not-union per-entry required_env.
- This template PR is the per-template observability layer: when one
  of those universal fixes regresses (or when an operator misconfigs a
  vendor key), the boot logs say exactly which env was present at each
  tier. Validated end-to-end on workspace
  be27badd-00a7-4cef-91e8-af428175c76f (clean boot, MINIMAX_API_KEY=set
  audited, no crash-loop).

Closes part of molecule-monorepo task #248. Sibling of #2538 for
molecule-core.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:41:05 -07:00
Hongming Wang 863a1e6f87 Merge pull request #31 from Molecule-AI/docs/claude-md-runtime-wedge-channels
docs(claude): add runtime_wedge + dev-channels CLI flag sections
2026-05-01 20:10:02 -07:00
Hongming Wang 7c23ea65a5 docs(claude): document runtime_wedge integration + dev-channels server tag 2026-05-01 20:04:11 -07:00
Hongming Wang ed05990ffe Merge pull request #30 from Molecule-AI/chore/runtime-wedge-mirror-followups
chore(executor): runtime_wedge mirror follow-ups from PR #29 review
2026-05-01 18:07:43 -07:00
Hongming Wang 02e4520cf3 chore(executor): runtime_wedge mirror follow-ups from PR #29 review
Two review nits:

1. Narrow the import-arm catch in _mark_sdk_wedged and
   _clear_sdk_wedge_on_success to (ImportError, ModuleNotFoundError).
   The bare `except Exception:` would have masked an AttributeError /
   TypeError from a runtime_wedge API rename — silently degrading the
   mirror to "no-op" and making heartbeat + the smoke gate (#131)
   blind to claude-code wedges. The structural snapshot test in
   molecule-core (task #169) catches the rename at PR-time. Older
   runtimes that don't ship runtime_wedge at all still hit ImportError
   and silently no-op — the local sticky flag still gates is_wedged()
   inside this module so internal callers keep working.

2. Add mirror-CALL-failure injection tests. The recorder used by the
   original tests never raised, so the inner try around
   _mark_runtime_wedged(reason) (and the symmetric clear) wasn't
   pinned. New tests inject a recorder whose mark/clear raise on call,
   then assert: (a) the call attempt was recorded, (b) the local
   sticky flag stayed correct, (c) the failure was logged at ERROR.
   Pins both the contract ("mirror is best-effort, local is source of
   truth") AND the operator-visible signal (an ERROR log line is the
   only way to see a silent mirror regression).

Regression-injection-checked: removing the call-side try arm makes
both new tests fail with clear messages. Tests: 7 in
test_runtime_wedge_mirror.py, 45 across the whole tests/ tree.
2026-05-01 18:04:24 -07:00
Hongming Wang 2dcedc14d3 Merge pull request #29 from Molecule-AI/feat/wire-sdk-wedge-to-runtime-wedge
feat(executor): mirror SDK wedge into molecule_runtime.runtime_wedge
2026-05-01 18:02:40 -07:00
Hongming Wang b2561aa825 feat(executor): mirror SDK wedge into molecule_runtime.runtime_wedge
The local _sdk_wedged_reason flag was only observed inside this module
— heartbeat reads runtime_wedge.is_wedged() (universal cross-cutting
holder) and so does the new boot-smoke gate from molecule-core PR
#2473 / task #131. Without the mirror, a wedged claude-code workspace
stayed green-dot on the canvas while every chat hung, AND the
publish-image gate could not catch PR-25-class init wedges before
the broken image shipped to GHCR.

_mark_sdk_wedged now mirrors into runtime_wedge.mark_wedged, and
_clear_sdk_wedge_on_success mirrors into runtime_wedge.clear_wedge.
Both are best-effort — older runtimes that don't ship runtime_wedge
silently no-op the mirror, so a template pinned to an older runtime
still boots. Mirror exceptions are logged but don't suppress the
local sticky flag, so internal callers (retry loop, cancel handler)
see consistent state regardless of the universal-side outcome.

Tests cover: mark mirrors with reason, first-call-wins propagates,
clear mirrors, no-op when not wedged, ImportError-resilience.
Regression-injection-checked: silencing the mirror branch fails the
mark+first-wins tests at unit-test time with a clear message naming
the missing runtime_wedge call.
2026-05-01 17:52:24 -07:00
Hongming Wang c72b4b10ee Merge pull request #27 from Molecule-AI/fix/gate-dev-channels-flag-behind-env
fix(executor): pass tagged server:molecule to --dangerously-load-development-channels
2026-05-01 17:18:02 -07:00
Hongming Wang 9eb7d7b6cd fix(executor): pass tagged server:molecule to dev-channels flag
Claude Code 2.1.x changed the flag's signature to take an *allowlist* of
tagged entries — `server:<name>` for manually-configured MCP servers,
`plugin:<name>@<marketplace>` for plugin channels. PR #25's
`{flag: None}` rendered as a bare `--<flag>` with no value, the CLI
rejected with `argument missing`, and the SDK timed out at `initialize`,
surfacing upstream as `Control request timeout: initialize` (caught
live on workspace dd40faf8 on 2026-05-01 — 100% of A2A turns wedged).

Pass `server:molecule` so the SDK forwards
`--dangerously-load-development-channels server:molecule`. Live-verified
end-to-end: A2A returns coherent replies AND the host claude session
renders inbound canvas messages as `<channel source="molecule" ...>`
tags inline (push UX without inbox poll).

Tests: replace the unconditional `None` pin with a tagged-form pin
that asserts the exact `server:molecule` value, plus a defense-in-depth
test that pins the invariants (non-None, non-empty, contains tag
colon) so any regression to the bare-switch shape fails at unit-test
time instead of surfacing as a live SDK initialize wedge. 38/38 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:15:49 -07:00