Compare commits

...

13 Commits

Author SHA1 Message Date
hongming-pc2 062096b20d fix(executor): surface the CLI stream error instead of the swallowed-stderr placeholder
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 21s
CI / Adapter unit tests (pull_request) Successful in 1m38s
CI / Adapter unit tests (push) Successful in 1m46s
CI / validate (pull_request) Successful in 7m0s
CI / validate (push) Successful in 7m3s
When the `claude` CLI errors mid-stream, claude-agent-sdk throws a bare
`Exception("Command failed with exit code 1 …")` whose only text is the
useless `Check stderr output for details` placeholder — but the *actual*
failure reason (model 404, rate limit, auth) arrived a moment earlier as a
stream-json `ResultMessage(is_error=True)` carrying `result` text and
`api_error_status`. That was thrown away.

`_run_query` now captures `ResultMessage(is_error=True)` detail (and, as a
fallback, the trailing AssistantMessage text) and re-attaches it to the
raised exception as `_molecule_stream_detail`. `_format_process_error`
surfaces it as `cli_stream_error=…` and, when present, skips the
`_probe_claude_cli_error` re-probe (#160) — the probe can't replay the
failing `--model`/`--system-prompt` argv, so it may even succeed and
mislead. The probe stays as the last resort when there's nothing to salvage.

Regression context: the 2026-05-10 dev-team incident — six lead workspaces
404ing on every turn (`--model claude-code` → `api_error_status=404`,
"There's an issue with the selected model (claude-code)"), invisible for an
hour because the CLI wrote nothing to stderr and this text was discarded.
See internal#226 follow-up #5.

Tests: tests/test_executor_error_detail.py — 6 cases (format surfaces the
salvaged detail; format still probes when there's nothing salvaged; salvaged
detail takes precedence over the probe; _run_query annotates from
ResultMessage(is_error); _run_query falls back to assistant text; clean
success path unaffected). `pytest tests/` → 87 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-10 03:10:36 -07:00
claude-ceo-assistant d2585700f5 fix(adapter): mirror provider alias map onto YAML path (#12)
Secret scan / Scan diff for credential-shaped strings (push) Successful in 4s
CI / Adapter unit tests (push) Successful in 1m21s
CI / validate (push) Failing after 2m9s
[FORCE-MERGE AUDIT — §SOP-7] hongming chat-go ("do both") in transcript ~03:54 UTC 2026-05-10. Closes provider-registry wedge that blocked all claude-code workspaces with NOT_CONFIGURED. Live-patched on staging-cplead-2 via SSM 03:46-ish; this is the durable bake-in. 81 tests pass + 3 new regression tests.
2026-05-10 03:51:28 +00:00
Claude CEO Assistant aaa2a79e81 fix(adapter): alias-map yaml_provider for runtime-wheel default
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
CI / Adapter unit tests (push) Successful in 1m21s
CI / Adapter unit tests (pull_request) Successful in 1m18s
CI / validate (pull_request) Failing after 2m15s
CI / validate (push) Failing after 5m36s
The molecule-runtime wheel auto-derives `runtime_config.provider =
"anthropic"` from its default model slug `anthropic:claude-opus-4-7`
when the per-workspace YAML omits both fields. The adapter receives
that derived `anthropic` as `yaml_provider` and rejects it because the
providers registry only knows `anthropic-oauth` / `anthropic-api`. The
existing alias map (`anthropic` → `anthropic-api`,
`claude-code` → `anthropic-oauth`) was applied only on the env-var
path; mirroring it on the YAML path resolves the wheel default to a
registered provider name.

Symptom on staging-cplead-2 (2026-05-09): every workspace booted with
`configuration_status=not_configured` and
`configuration_error="ValueError: claude-code adapter: workspace
config picks provider='anthropic' but it is not in the providers
registry"`. Live-patched the running cp-lead workspaces to confirm the
fix; this commit lands the durable change in the template repo so
freshly-provisioned workspaces don't repeat the wedge.

Tests:
  - test_yaml_provider_anthropic_is_aliased_to_anthropic_api (regression)
  - test_yaml_provider_claude_code_is_aliased_to_anthropic_oauth (symmetry)
  - test_yaml_provider_unknown_passes_through_for_actionable_error
    (guards the silent-fallback bug from #180; unaliased unknowns must
    still reach _resolve_provider so it raises with the helpful
    "Known providers: ..." message)

All 81 tests pass locally.

Refs: staging-cplead-2 incident 2026-05-09
Live-patched workspaces: 941a929e, 99de7cab, a8ba9dc8, a00e74df

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-09 20:46:02 -07:00
claude-ceo-assistant 4b038f2947 Merge pull request 'fix(adapter): map persona-friendly slugs (claude-code, anthropic) to registry names' (#10) from fix/dispatch-alias-map-followup into main
Secret scan / Scan diff for credential-shaped strings (push) Successful in 37s
CI / Adapter unit tests (push) Failing after 12m10s
CI / validate (push) Failing after 17m11s
2026-05-08 21:24:27 +00:00
claude-ceo-assistant 8adc3576fd fix(adapter): map persona-friendly slugs (claude-code, anthropic) to registry names
CI / Adapter unit tests (push) Successful in 1m46s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 50s
CI / Adapter unit tests (pull_request) Successful in 2m16s
CI / validate (pull_request) Successful in 6m18s
CI / validate (push) Failing after 18m56s
Phase 4 verification surfaced a follow-up edge case the initial fix missed:
the persona env files use friendlier slugs than the registry's canonical names:
  * MODEL_PROVIDER=claude-code  -> anthropic-oauth (Claude Code subscription)
  * MODEL_PROVIDER=anthropic    -> anthropic-api  (direct Anthropic API key)

Without an alias map, a lead workspace's MODEL_PROVIDER=claude-code env
fell through the slug-detection path; when the YAML didn't pin a
provider, the model-prefix matcher saw MODEL=MiniMax-M2.7 and routed the
lead to MiniMax — even though CLAUDE_CODE_OAUTH_TOKEN was clearly the
intended auth path.

Add _PROVIDER_SLUG_ALIASES with the two operator-facing slugs that don't
match registry names verbatim. The alias map is consulted before the
slug-vs-legacy detection, so claude-code now resolves to anthropic-oauth
and the lead boots through OAuth as intended.

Tests
-----
+ test_persona_env_lead_with_minimax_model_routes_via_oauth — lock in
  the alias-map behavior so a future contributor can't silently re-introduce
  the lead-mis-routed-to-MiniMax bug.
+ test_anthropic_alias_resolves_to_anthropic_api — covers the second
  alias path.

Updated test_persona_env_lead_claude_code_resolves_correctly to assert
the new (correct) behavior: provider == 'anthropic-oauth', not None.

Full adapter suite: 78/78 pass.
2026-05-08 14:23:59 -07:00
claude-ceo-assistant 134ba7f82c fix(adapter): honor MODEL/MODEL_PROVIDER env (persona-env convention) (#9)
Secret scan / Scan diff for credential-shaped strings (push) Successful in 16s
CI / Adapter unit tests (push) Failing after 37s
CI / validate (push) Failing after 50s
Fix 2026-05-08 dev-tree wedge: 22/27 non-lead workspaces stuck at SDK initialize timeout because MODEL_PROVIDER=minimax was read as model id instead of provider slug.
2026-05-08 21:12:21 +00:00
claude-ceo-assistant 1742b60e62 fix(adapter): honor MODEL/MODEL_PROVIDER env (persona-env convention)
CI / Adapter unit tests (push) Successful in 1m40s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 19s
CI / Adapter unit tests (pull_request) Failing after 52s
CI / validate (push) Failing after 2m17s
CI / validate (pull_request) Successful in 13m19s
Fixes the 2026-05-08 dev-tree wedge: 22/27 non-lead workspaces (minimax tier)
stuck in degraded after /org/import, every chat hanging on
`Control request timeout: initialize`.

Root cause
----------
The persona env files (`~/.molecule-ai/personas/<name>/env`) declare a TWO-
variable convention:
  - MODEL          = model id   ("MiniMax-M2.7-highspeed")
  - MODEL_PROVIDER = provider slug ("minimax")

The runtime wheel's legacy `workspace/config.py` interprets MODEL_PROVIDER
as the *model id* — a name chosen long before there was a separate MODEL
env. With both set, the legacy code reads MODEL_PROVIDER="minimax" into
runtime_config.model. The literal string "minimax" doesn't match any
registry prefix (`minimax-` requires a hyphen suffix), falls through to
providers[0] (anthropic-oauth), the auth check fails on the absent
CLAUDE_CODE_OAUTH_TOKEN, the claude CLI launches anyway, and the SDK's
`query.initialize()` 60s control timeout fires.

The brief hypothesised `claude_sdk_executor.py` lacked dispatch logic.
Phase 1 evidence: dispatch ALREADY exists in adapter.py — model -> provider
-> base_url + auth_env routing was correctly built for #180. The bug was
upstream: MODEL_PROVIDER's name collision with the persona-env convention
silently corrupted the picked model BEFORE adapter.py saw it.

Fix
---
New helper `_resolve_model_and_provider_from_env` reconciles env vars
against YAML inside adapter.setup() and create_executor():

  1. MODEL env -> picked_model (authoritative when set).
  2. MODEL_PROVIDER env -> explicit_provider IFF the value matches a
     registered provider name. Backward-compat: if MODEL is unset and
     MODEL_PROVIDER doesn't match a registered slug, treat it as a
     legacy model id (canvas Save+Restart pre-this-fix).
  3. YAML runtime_config.{model,provider} fills any field env didn't
     supply.

Contained in the template repo per the brief's scope guidance — does NOT
touch the runtime wheel's workspace/config.py (which would need a separate
molecule-core PR), and does NOT change the persona-env dispatch policy
(Phase 2 mapping 2026-05-08).

Tests
-----
Eleven new cases in tests/test_env_model_provider_dispatch.py covering:
  - persona-env shape (minimax, GLM, lead claude-code) -> correct model + slug
  - legacy MODEL_PROVIDER-as-model-id shape still works
  - env wins over YAML
  - YAML fallback when env unset
  - whitespace/empty defensive handling
  - case-insensitive provider slug matching

Full adapter test suite: 76/76 pass.

Verification path
-----------------
After image rebuild + workspace re-provision, ws-* containers will boot
with provider=minimax (not anthropic-oauth), ANTHROPIC_BASE_URL set to
https://api.minimax.io/anthropic, MINIMAX_API_KEY projected onto
ANTHROPIC_AUTH_TOKEN, and the SDK init handshake succeeding.

Refs: task #181, brief 2026-05-08, related #180 (#7 in this repo)
2026-05-08 14:11:42 -07:00
claude-ceo-assistant 56a045f38e Merge pull request 'fix(adapter,tests): isolate _load_providers tests from multi-path lookup' (#8) from fix/load-providers-tests-isolate-multipath into main
Secret scan / Scan diff for credential-shaped strings (push) Successful in 3s
CI / Adapter unit tests (push) Successful in 54s
CI / validate (push) Successful in 3m6s
2026-05-08 20:28:14 +00:00
dev-lead 291f356dab fix(adapter,tests): isolate _load_providers tests from multi-path lookup
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
CI / Adapter unit tests (push) Successful in 1m1s
CI / Adapter unit tests (pull_request) Successful in 1m2s
CI / validate (push) Successful in 3m23s
CI / validate (pull_request) Successful in 3m22s
The 5 _load_providers tests were single-path-only: they wrote a
config.yaml to tmp_path and called _load_providers(str(tmp_path)),
expecting the lookup to read tmp_path/config.yaml.

After the multi-path fix in #7, _load_providers also checks:
  1. _CANONICAL_ADAPTER_DIR/config.yaml  (= /opt/adapter/config.yaml)
  2. _TEMPLATE_DIR/config.yaml           (= dirname(__file__)/config.yaml)
  3. ${config_path}/config.yaml          (the test's tmp_path)

Path 2 finds the repo's bundled config.yaml on the test runner's
disk before path 3 — the tests then see the bundled providers list
instead of the test's expected behavior.

Two surface changes:

  1. adapter.py — extract `os.path.dirname(os.path.abspath(__file__))`
     into a module-level `_TEMPLATE_DIR` constant, mirroring
     `_CANONICAL_ADAPTER_DIR`. Production behavior identical
     (resolved once at import). Tests can monkeypatch the module
     attribute to redirect the path-2 lookup.

  2. tests/test_adapter_prevalidate.py — 5 _load_providers tests
     monkeypatch `_CANONICAL_ADAPTER_DIR` and `_TEMPLATE_DIR` to a
     non-existent tmp subdir, isolating the test to the workspace
     config_path branch they always meant to test.

The 6th _load_providers test (`test_load_providers_parses_yaml_and_normalizes`)
already passed because path 2 returns 7 providers and that's what
that test expects — left unchanged.

Verification:
  pytest tests/                                 65/65 PASS
  pytest tests/test_adapter_prevalidate.py -k load_providers
                                                  6/6 PASS

Closes molecule-core#129 follow-up — the unit tests were the last
red on the template repo's CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:27:56 -07:00
claude-ceo-assistant 91022654cd Merge pull request 'fix(adapter): restore multi-path _load_providers (closes molecule-core#129 failure mode #1)' (#7) from fix/load-providers-multipath-restore into main
Secret scan / Scan diff for credential-shaped strings (push) Successful in 5s
CI / Adapter unit tests (push) Failing after 1m5s
CI / validate (push) Successful in 3m9s
2026-05-08 20:12:37 +00:00
dev-lead b96a6d2569 fix(adapter): restore multi-path _load_providers (canonical + template + workspace)
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
CI / Adapter unit tests (pull_request) Failing after 59s
CI / Adapter unit tests (push) Failing after 1m6s
CI / validate (pull_request) Successful in 3m22s
CI / validate (push) Successful in 3m21s
The template's _load_providers had only ONE lookup path
(${config_path}/config.yaml = /configs/config.yaml) — which is the
per-workspace override, NOT the template's bundled provider registry.
Every MiniMax/GLM/Kimi/DeepSeek model resolved to anthropic-oauth
and crashed at first LLM call:

  None of CLAUDE_CODE_OAUTH_TOKEN set for model=MiniMax-M2.7-highspeed
    (provider=anthropic-oauth) — the adapter will fail on the first
    LLM call with AuthenticationError
  ...
  probed_cli_error='Not logged in · Please run /login'

Canary chronic red 38h+ on 2026-05-07/08 traced to this. The fix
that the May-4 image already had bundled — a 4-path lookup with
canonical /opt/adapter/config.yaml + __file__-adjacent + workspace
override + builtins fallback — was never on Gitea main, so post-
suspension rebuilds dropped it. Restoring here.

Resolution order:
  1. /opt/adapter/config.yaml (canonical, provisioner-contracted)
  2. dirname(__file__)/config.yaml (covers /app/config.yaml from
     Dockerfile #6 as well as dev/test imports)
  3. ${config_path}/config.yaml (per-workspace override)
  4. _BUILTIN_PROVIDERS (oauth + anthropic-api fallback)

Verified locally: ps=_load_providers('/nonexistent') returns the
7 providers from /tmp/cctmpl/config.yaml via path 2 (the
__file__-adjacent lookup). Without the fix, returns 2 (builtins).

Closes molecule-core#129 failure mode #1 (the original "Agent error
(Exception)" 38h chronic red).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 13:12:24 -07:00
claude-ceo-assistant 2edd78c154 Merge pull request 'fix(dockerfile): bundle config.yaml into /app so providers registry loads' (#6) from fix/dockerfile-bundle-config-yaml into main
Secret scan / Scan diff for credential-shaped strings (push) Successful in 4s
CI / Adapter unit tests (push) Successful in 57s
CI / validate (push) Successful in 3m14s
2026-05-08 18:19:10 +00:00
dev-lead ad4241cebb fix(dockerfile): bundle config.yaml into /app so providers registry loads
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
CI / Adapter unit tests (push) Successful in 55s
CI / Adapter unit tests (pull_request) Successful in 1m0s
CI / validate (pull_request) Successful in 3m10s
CI / validate (push) Successful in 3m10s
The adapter's _load_providers tries 4 paths in order:
  1. /opt/adapter/config.yaml  — provisioner-managed (currently missing)
  2. os.path.dirname(__file__)/config.yaml  — alongside adapter.py
  3. ${WORKSPACE_CONFIG_PATH}/config.yaml  — workspace overrides
  4. _BUILTIN_PROVIDERS  — oauth + anthropic-api only

On this template's docker image /opt/adapter/ is never populated by
the platform provisioner (verified 2026-05-08 by SSM-exec on a live
canary's workspace EC2: ls /opt/adapter/ → no such file or directory).
That makes path 2 — the dir adjacent to /app/adapter.py — the
load-bearing one for production workloads.

The Dockerfile copies adapter.py + claude_sdk_executor.py + scripts/
+ entrypoint.sh + __init__.py into /app, but it does NOT copy
config.yaml. So /app/config.yaml doesn't exist, path 2 fails, and
the adapter falls all the way through to _BUILTIN_PROVIDERS.

_BUILTIN_PROVIDERS contains only anthropic-oauth + anthropic-api.
Every MiniMax / GLM / Kimi / DeepSeek model id has no matching
prefix in those two, so _resolve_provider returns providers[0] =
anthropic-oauth (per "unknown ids fall back to providers[0]" rule).
That provider needs CLAUDE_CODE_OAUTH_TOKEN, which is unset for
non-OAuth tenants. The claude CLI fails with:
  Not logged in · Please run /login

…which surfaces in the A2A response as "Agent error (Exception)".

This is the root cause of:
  • Canary chronic red since 2026-05-07 02:30 UTC (38h+ at time of
    investigation)
  • molecule-core#129 failure mode #1
  • Memory feedback_template_vs_workspace_config_separation
    (template-claude-code PR #37 added the multi-path lookup but
    didn't bundle config.yaml into the image — the lookup paths
    point at files that don't exist)

Fix: one-line `COPY config.yaml .` in the Dockerfile.

Verification path (post-merge): publish-runtime workflow rebuilds
the image, deploys to staging tenant fleet, next canary cron run
sees /app/config.yaml → loads minimax provider → MINIMAX_API_KEY
matches → claude CLI auths → A2A returns PONG → green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 11:15:39 -07:00
6 changed files with 890 additions and 43 deletions
+13
View File
@@ -43,6 +43,19 @@ RUN pip install --no-cache-dir -r requirements.txt && \
# Copy adapter code
COPY adapter.py .
COPY __init__.py .
# Provider registry. The adapter's _load_providers walks 4 paths:
# 1. /opt/adapter/config.yaml — provisioner-managed canonical
# 2. os.path.dirname(__file__)/config.yaml — alongside adapter.py (this image)
# 3. ${WORKSPACE_CONFIG_PATH}/config.yaml — workspace per-instance overrides
# 4. _BUILTIN_PROVIDERS — oauth + anthropic-api only
# On this image /opt/adapter/ is never populated by the platform
# provisioner, so path 2 (/app/config.yaml) is the load-bearing one.
# Without this COPY the file isn't in the image, all 3 file paths fail,
# and _load_providers falls through to _BUILTIN_PROVIDERS — every
# MiniMax/GLM/Kimi/DeepSeek model silently routes to anthropic-oauth →
# "Not logged in. Please run /login" at first LLM call. Caused the
# canary's 38h chronic red on 2026-05-07/08 (molecule-core#129).
COPY config.yaml .
# Adapter-specific executor — owned by THIS template (universal-runtime
# refactor, molecule-core task #87). Lives alongside adapter.py so
# Python's import system picks the local /app/claude_sdk_executor.py
+274 -32
View File
@@ -147,36 +147,118 @@ def _normalize_provider(entry: dict):
}
# Canonical install path the platform provisioner is contracted to clone
# the template repo into. Hardcoded so the adapter's config.yaml lookup
# is invariant across Docker (mounted /app→/opt/adapter) and EC2-host
# (cloned by molecule-controlplane's ec2.go) install paths — robust
# against the site-packages copy that bit us 2026-05-04 11:08Z.
_CANONICAL_ADAPTER_DIR = "/opt/adapter"
# Adjacent-to-adapter.py path. Module-level so tests can monkeypatch it
# to redirect the path-2 lookup at a controlled tmp dir. Production code
# resolves this once at import time and never touches it again — same
# semantics as before.
_TEMPLATE_DIR = os.path.dirname(os.path.abspath(__file__))
def _load_providers(config_path: str) -> tuple:
"""Load the provider registry from /configs/config.yaml.
"""Load the provider registry from the template's bundled config.yaml.
The YAML's top-level ``providers:`` list is the canonical source —
canvas Config tab reads the same list to populate its Provider
dropdown so the UI and the adapter never disagree on what's
available. Falls back to ``_BUILTIN_PROVIDERS`` (oauth + anthropic-api)
if the file is missing, malformed, or has no providers section, so a
bare-bones workspace still boots with the historical defaults.
The providers list is a TEMPLATE concern — it describes which
models/auth-modes this runtime image supports — and ships in the
template's own config.yaml alongside adapter.py. The per-workspace
``${WORKSPACE_CONFIG_PATH}/config.yaml`` (default ``/configs/``)
only contains workspace-specific overrides (model, runtime, skills,
prompt files) and does NOT carry a providers section.
Per-entry isolation: a single bad provider entry is dropped with a
warning; the rest of the registry survives. Used to be a generator
inside tuple(...) that propagated any AttributeError out and reverted
the whole registry to builtins — exactly the silent-fallback failure
mode this file's existence was meant to fix.
Two-step incident history:
• Pre-2026-05-04 09:00Z: only checked ``config_path``, fell back
to ``_BUILTIN_PROVIDERS`` (oauth + anthropic-api). Every
MiniMax / GLM / Kimi / DeepSeek model resolved to
``anthropic-oauth`` and crashed at first LLM call with
"Not logged in. Please run /login". Fixed by adding a
template-bundled lookup using
``os.path.dirname(os.path.abspath(__file__))``.
• 2026-05-04 11:08Z: that ``__file__`` lookup misses on EC2-host
installs because the provisioner copies adapter.py to
``/opt/molecule-venv/lib/python3.12/site-packages/`` —
site-packages wins over PYTHONPATH=/opt/adapter (which the
host install doesn't set), so __file__ resolves to the venv
path WITHOUT an adjacent config.yaml. Same silent fallback
to anthropic-oauth + same "Not logged in" symptom.
• 2026-05-08 (#129): the multi-path lookup that fixed both of
the above was lost in a post-suspension migration cycle (the
Gitea main branch never carried the fix even though the
:latest image had it baked in from a prior build). Canary
chronic red for 38h before this commit restored the lookup.
Resolution order:
1. ``/opt/adapter/config.yaml`` — canonical provisioner-managed
install dir. Hardcoded because the platform contract is
"provisioner clones template repo into /opt/adapter"; this
is invariant across Docker (mounted /app→/opt/adapter) and
EC2-host (cloned by ec2.go) install paths. Robust against
site-packages copy.
2. Adjacent to ``adapter.__file__`` — works in dev/test where
the canonical path doesn't exist. Also covers the Docker
image's /app/config.yaml (bundled by Dockerfile #6).
3. Per-workspace ``${config_path}/config.yaml`` — fallback for
operator-shipped overrides on a private deployment that
wants a custom providers list.
4. ``_BUILTIN_PROVIDERS`` — oauth + anthropic-api defaults so a
bare-bones workspace still boots even with no config.yaml
anywhere.
Per-entry isolation: a single bad provider entry is dropped with
a warning; the rest of the registry survives.
"""
yaml_path = os.path.join(config_path, "config.yaml")
canonical_yaml = os.path.join(_CANONICAL_ADAPTER_DIR, "config.yaml")
template_yaml = os.path.join(_TEMPLATE_DIR, "config.yaml")
workspace_yaml = os.path.join(config_path, "config.yaml")
# Deduplicate while preserving order — _CANONICAL_ADAPTER_DIR and
# the __file__ dir collide in dev/test (when imported from
# /opt/adapter directly), and workspace_yaml may also collide if
# config_path == /opt/adapter in tests.
seen = set()
candidates = []
for path in (canonical_yaml, template_yaml, workspace_yaml):
if path not in seen:
seen.add(path)
candidates.append(path)
raw = None
chosen_path = None
try:
import yaml # transitive dep via molecule-ai-workspace-runtime
with open(yaml_path, "r") as f:
data = yaml.safe_load(f) or {}
except FileNotFoundError:
logger.info("providers: %s not found, using builtin defaults", yaml_path)
return _BUILTIN_PROVIDERS
except Exception as exc: # noqa: BLE001 — defensive: never block boot on YAML
logger.warning("providers: failed to load from %s (%s); using builtins", yaml_path, exc)
except ImportError:
logger.warning("providers: yaml import failed; using builtins")
return _BUILTIN_PROVIDERS
raw = data.get("providers") if isinstance(data, dict) else None
if not isinstance(raw, list) or not raw:
for yaml_path in candidates:
try:
with open(yaml_path, "r") as f:
data = yaml.safe_load(f) or {}
except FileNotFoundError:
logger.info("providers: %s not found, trying next candidate", yaml_path)
continue
except Exception as exc: # noqa: BLE001 — defensive: never block boot on YAML
logger.warning(
"providers: failed to load from %s (%s); trying next candidate",
yaml_path, exc,
)
continue
candidate_raw = data.get("providers") if isinstance(data, dict) else None
if isinstance(candidate_raw, list) and candidate_raw:
raw = candidate_raw
chosen_path = yaml_path
break
if raw is None:
logger.info(
"providers: no providers section found in %s; using builtin defaults",
" or ".join(candidates),
)
return _BUILTIN_PROVIDERS
parsed = []
@@ -190,11 +272,139 @@ def _load_providers(config_path: str) -> tuple:
parsed.append(normalized)
if not parsed:
logger.warning("providers: no valid entries in %s; using builtins", yaml_path)
logger.warning("providers: no valid entries in %s; using builtins", chosen_path)
return _BUILTIN_PROVIDERS
logger.info("providers: loaded %d entries from %s", len(parsed), chosen_path)
return tuple(parsed)
# Aliases for `MODEL_PROVIDER` env values that should map to a registry
# provider name. The persona env files use shorter / friendlier slugs
# than the registry's canonical names — without this alias map a value
# like ``MODEL_PROVIDER=claude-code`` would fall through to YAML-based
# resolution and (when the YAML doesn't pin a provider) hit the
# model-prefix matcher with the operator-picked MODEL, mis-routing a
# lead workspace through MiniMax even though its CLAUDE_CODE_OAUTH_TOKEN
# was clearly meant to be used.
#
# Maintain this list in sync with the persona env file convention:
# - ``claude-code`` → ``anthropic-oauth`` (Claude Code subscription path)
# - ``anthropic`` → ``anthropic-api`` (direct Anthropic API key)
# Provider names already in the registry alias to themselves implicitly
# (the ``in registry`` check catches them before this map is consulted).
_PROVIDER_SLUG_ALIASES = {
"claude-code": "anthropic-oauth",
"anthropic": "anthropic-api",
}
def _resolve_model_and_provider_from_env(
yaml_model: str,
yaml_provider: str,
providers: tuple,
) -> tuple:
"""Reconcile model + provider from env vars vs YAML, with the persona-env
convention winning over the legacy ``MODEL_PROVIDER``-as-model-id usage.
The persona env files (``~/.molecule-ai/personas/<name>/env`` on the host,
sourced into each workspace container at provision time) declare TWO env
vars with distinct semantics:
* ``MODEL`` — the model id (e.g. ``MiniMax-M2.7-highspeed``, ``opus``).
* ``MODEL_PROVIDER`` — the provider slug (e.g. ``minimax``,
``claude-code``, ``anthropic``).
The legacy ``workspace/config.py`` (in molecule-ai-workspace-runtime)
historically interpreted ``MODEL_PROVIDER`` as the *model id* — a name
chosen before there was a separate ``MODEL`` env var. When both env vars
are set with the persona convention, the legacy code reads
``MODEL_PROVIDER=minimax`` into ``runtime_config.model``, which then
fails to match any registry prefix (``minimax-`` requires a hyphen
suffix) and silently falls through to providers[0] (``anthropic-oauth``).
OAuth-token-less workspaces then wedge at ``query.initialize()`` because
the claude CLI can't authenticate. This is the 2026-05-08 dev-tree
incident — 22/27 non-lead workspaces stuck in ``degraded``.
Resolution order (this function):
1. ``MODEL`` env var → picked_model. Authoritative when set; the
persona env always sets it alongside ``MODEL_PROVIDER`` so the
model id never has to be inferred.
2. ``MODEL_PROVIDER`` env var → explicit_provider, BUT only when the
value matches a known provider name in the registry. This guards
against the legacy case where some callers still set
``MODEL_PROVIDER`` to a model id (e.g. canvas Save+Restart prior to
this fix). If the value isn't a registered provider name and YAML
didn't supply a model, treat it as a model id for back-compat.
3. YAML ``runtime_config.model`` / ``provider`` — used for any field
the env didn't supply. Carries the operator's canvas selection
on workspaces that haven't yet adopted the persona env shape.
Returns ``(picked_model, explicit_provider_name)``. Either may be
empty/None — the caller (``setup``) handles the empty cases via
``_resolve_provider``'s registry fallback.
"""
env_model = (os.environ.get("MODEL") or "").strip()
env_provider = (os.environ.get("MODEL_PROVIDER") or "").strip()
provider_names_lower = {p.get("name", "").lower() for p in providers}
# Detect whether MODEL_PROVIDER carries the persona-convention slug
# (provider name) vs. the legacy convention (model id). Persona-
# convention wins when the value matches a registered provider; we
# fall back to legacy interpretation only when it doesn't.
#
# First, apply the alias map so persona-friendly slugs like
# ``claude-code`` resolve to the canonical registry name
# ``anthropic-oauth``. Without this, a lead workspace's
# ``MODEL_PROVIDER=claude-code`` env would fall through to the model-
# prefix matcher, see ``MODEL=MiniMax-M2.7`` and mis-route to MiniMax
# even though the operator's intent (and the OAuth token they set)
# was the OAuth subscription path.
env_provider_resolved = _PROVIDER_SLUG_ALIASES.get(
env_provider.lower(), env_provider,
) if env_provider else ""
env_provider_is_slug = (
bool(env_provider_resolved)
and env_provider_resolved.lower() in provider_names_lower
)
# Picked model resolution
if env_model:
picked_model = env_model
elif env_provider and not env_provider_is_slug:
# Legacy: MODEL_PROVIDER env carried the model id. Honor it so
# canvas Save+Restart workflows that predate this fix keep working.
picked_model = env_provider
else:
picked_model = yaml_model or ""
# Explicit provider resolution — env wins when it's a registered slug
# (after alias mapping), otherwise fall back to YAML.
#
# YAML aliasing: the molecule-runtime wheel (config.py) auto-derives
# ``runtime_config.provider`` from the YAML/default model slug — the
# default model ``anthropic:claude-opus-4-7`` yields ``anthropic`` as
# the inferred provider. Without applying the alias map here, that
# auto-derived ``anthropic`` slug fails registry lookup and the
# adapter raises ValueError ("provider='anthropic' but it is not in
# the providers registry"), wedging the workspace at boot. The alias
# map already handles this for the env-var path above; mirror the
# same treatment for the YAML path so the runtime-wheel default
# produces a registered provider name in both cases. Caught
# 2026-05-09 on staging-cplead-2 — every workspace booted with
# ``configuration_status=not_configured`` because the YAML provider
# ``anthropic`` was passed through verbatim instead of being aliased
# to ``anthropic-api``.
if env_provider_is_slug:
explicit_provider = env_provider_resolved
elif yaml_provider:
yp_lower = yaml_provider.lower()
explicit_provider = _PROVIDER_SLUG_ALIASES.get(yp_lower, yaml_provider)
else:
explicit_provider = None
return picked_model, explicit_provider
def _strip_provider_prefix(model: str) -> str:
"""Strip LangChain-style "<provider>:<model>" prefix from a model id.
@@ -453,11 +663,11 @@ class ClaudeCodeAdapter(BaseAdapter):
# validation + ANTHROPIC_BASE_URL routing from that single decision.
rc = config.runtime_config
if isinstance(rc, dict):
picked_model = rc.get("model") or "sonnet"
explicit_provider_name = rc.get("provider")
yaml_model = rc.get("model") or ""
yaml_provider_name = rc.get("provider") or ""
else:
picked_model = getattr(rc, "model", None) or "sonnet"
explicit_provider_name = getattr(rc, "provider", None)
yaml_model = getattr(rc, "model", None) or ""
yaml_provider_name = getattr(rc, "provider", None) or ""
# Also honor the top-level `provider:` field in /configs/config.yaml.
# The canvas Config-tab Provider dropdown writes there (not into
@@ -465,7 +675,7 @@ class ClaudeCodeAdapter(BaseAdapter):
# whichever is set wins. Root cause of #180: the adapter used to
# ignore both, silently routing every non-Anthropic provider pick
# through anthropic-oauth.
if not explicit_provider_name:
if not yaml_provider_name:
yaml_path = os.path.join(config.config_path, "config.yaml")
try:
import yaml # transitive dep via molecule-ai-workspace-runtime
@@ -474,7 +684,7 @@ class ClaudeCodeAdapter(BaseAdapter):
if isinstance(data, dict):
val = data.get("provider")
if isinstance(val, str) and val.strip():
explicit_provider_name = val.strip()
yaml_provider_name = val.strip()
except FileNotFoundError:
pass
except Exception as exc: # noqa: BLE001 — defensive: never block boot
@@ -484,6 +694,21 @@ class ClaudeCodeAdapter(BaseAdapter):
yaml_path, exc,
)
# Reconcile env vars (persona convention: MODEL=<id>,
# MODEL_PROVIDER=<slug>) against YAML. Env wins over YAML — the
# persona env files are the canonical per-agent provider mapping
# (Phase 2 mapping 2026-05-08), and the workspace-runtime wheel's
# legacy ``MODEL_PROVIDER``-as-model-id reading would otherwise
# silently route non-leads to providers[0] = anthropic-oauth.
# Documented in detail at _resolve_model_and_provider_from_env.
picked_model, explicit_provider_name = _resolve_model_and_provider_from_env(
yaml_model=yaml_model,
yaml_provider=yaml_provider_name,
providers=providers,
)
if not picked_model:
picked_model = "sonnet"
# NOTE: do NOT strip the provider prefix here. The pre-fix routing
# behavior — `anthropic:claude-opus-4-7` falls through to
# providers[0] (anthropic-oauth) when no model_prefixes match — is
@@ -611,9 +836,26 @@ class ClaudeCodeAdapter(BaseAdapter):
# RuntimeConfig dataclass. Read `model` defensively from either shape.
rc = config.runtime_config
if isinstance(rc, dict):
explicit_model = rc.get("model") or ""
yaml_model = rc.get("model") or ""
yaml_provider = rc.get("provider") or ""
else:
explicit_model = getattr(rc, "model", None) or ""
yaml_model = getattr(rc, "model", None) or ""
yaml_provider = getattr(rc, "provider", None) or ""
# Reconcile against env vars (persona convention: MODEL=<id>,
# MODEL_PROVIDER=<slug>) using the same helper that ``setup`` uses,
# so the executor and the boot banner agree on the picked model.
# Without this, a workspace whose env says ``MODEL=MiniMax-M2.7``
# but whose runtime wheel pre-dates the persona-env fix would set
# runtime_config.model="minimax" (the slug, mistakenly read by the
# legacy ``MODEL_PROVIDER``-as-model-id path); this helper restores
# the correct model id before it reaches the SDK.
providers = _load_providers(config.config_path)
explicit_model, _ = _resolve_model_and_provider_from_env(
yaml_model=yaml_model,
yaml_provider=yaml_provider,
providers=providers,
)
explicit_model = _strip_provider_prefix(explicit_model)
# Pre-validation: detect the misconfiguration combo that drove the
@@ -644,7 +886,7 @@ class ClaudeCodeAdapter(BaseAdapter):
"The default fallback ('sonnet') is an Anthropic-native "
"alias; non-Anthropic shims (MiniMax, OpenAI gateways, "
"etc.) won't recognize it and the SDK --print probe will "
"hang for 30s before timing out. Fix: set MODEL_PROVIDER "
"hang for 30s before timing out. Fix: set MODEL "
"as a workspace secret (canvas: Save+Restart with model "
"picked) or set runtime_config.model in /configs/config.yaml."
)
+54 -4
View File
@@ -377,21 +377,37 @@ def _format_process_error(exc: BaseException) -> str:
``_probe_claude_cli_error`` so the operator sees the real failure
reason (e.g. ``You've hit your limit · resets Apr 17``) instead of
chasing ghosts in the workspace logs.
internal#226: prefer ``exc._molecule_stream_detail`` — the failure
reason ``_run_query`` salvaged from the CLI's stream-json
``ResultMessage(is_error=True)`` (model 404, api_error_status, etc.)
before the SDK threw it away. That's the *exact* error for *this*
invocation; the ``_probe_claude_cli_error`` re-probe is a last resort
(it can't replay the failing ``--model``/``--system-prompt`` argv, so
it may even succeed and mislead — which is exactly what happened with
``--model claude-code`` on 2026-05-10).
"""
parts = [f"{type(exc).__name__}: {exc}"]
exit_code = getattr(exc, "exit_code", None)
if exit_code is not None:
parts.append(f"exit_code={exit_code}")
stream_detail = getattr(exc, "_molecule_stream_detail", None)
if stream_detail:
trimmed = stream_detail[:_PROCESS_ERROR_STDERR_MAX_CHARS]
if len(stream_detail) > _PROCESS_ERROR_STDERR_MAX_CHARS:
trimmed += f"... [{len(stream_detail) - _PROCESS_ERROR_STDERR_MAX_CHARS} more chars truncated]"
parts.append(f"cli_stream_error={trimmed!r}")
stderr = getattr(exc, "stderr", None)
if stderr:
trimmed = stderr[:_PROCESS_ERROR_STDERR_MAX_CHARS]
if len(stderr) > _PROCESS_ERROR_STDERR_MAX_CHARS:
trimmed += f"... [{len(stderr) - _PROCESS_ERROR_STDERR_MAX_CHARS} more chars truncated]"
parts.append(f"stderr={trimmed!r}")
elif exit_code is None and _SWALLOWED_STDERR_MARKER in str(exc):
# #160: generic exception with the swallowed-stderr placeholder.
# Probe the CLI directly — this is the only way to surface the real
# error when the SDK lost it in translation.
elif exit_code is None and not stream_detail and _SWALLOWED_STDERR_MARKER in str(exc):
# #160: generic exception with the swallowed-stderr placeholder AND no
# stream detail to fall back on — probe the CLI directly as a last
# resort. (If _run_query salvaged a stream detail we already have the
# real error; the probe is unreliable since it can't replay the argv.)
probed = _probe_claude_cli_error()
if probed:
parts.append(f"probed_cli_error={probed!r}")
@@ -586,6 +602,12 @@ class ClaudeSDKExecutor(AgentExecutor):
assistant_chunks: list[str] = []
result_text: str | None = None
session_id: str | None = None
# Captured from a ResultMessage(is_error=True) — the CLI's stream-json
# carries the *actual* failure reason (model 404, rate limit, auth) in
# the result text + api_error_status BEFORE the SDK throws a bare
# "Command failed with exit code 1" that loses it. Stashed so the
# except arm below can re-attach it (see _format_process_error).
stream_error_detail: str | None = None
self._active_stream = sdk.query(prompt=prompt, options=options)
try:
async for message in self._active_stream:
@@ -606,6 +628,34 @@ class ClaudeSDKExecutor(AgentExecutor):
if sid:
session_id = sid
result_text = getattr(message, "result", None)
if getattr(message, "is_error", False):
api_status = getattr(message, "api_error_status", None)
stream_error_detail = (
(f"api_error_status={api_status} " if api_status else "")
+ f"result={result_text!r}"
)
except BaseException as exc: # noqa: BLE001 — re-raised; we only annotate
# The claude-agent-sdk raises a bare Exception / ProcessError when
# the CLI subprocess errors mid-stream — but the actionable detail
# (model not found, rate limit, auth) arrived earlier as a
# ResultMessage(is_error) / synthetic AssistantMessage and is about
# to be discarded. Re-attach it so _format_process_error surfaces
# it instead of the useless "Check stderr output for details"
# placeholder. (The 2026-05-10 dev-team incident: `--model
# claude-code` → api_error_status=404, "There's an issue with the
# selected model (claude-code)" — invisible for an hour because
# the CLI wrote nothing to stderr and this text was thrown away.)
detail = stream_error_detail
if not detail:
last_assistant = "".join(assistant_chunks).strip()
if last_assistant:
detail = last_assistant[:_PROCESS_ERROR_STDERR_MAX_CHARS]
if detail and getattr(exc, "_molecule_stream_detail", None) is None:
try:
exc._molecule_stream_detail = detail # type: ignore[attr-defined]
except Exception: # pragma: no cover — exotic frozen exception
pass
raise
finally:
self._active_stream = None
text = result_text if result_text is not None else "".join(assistant_chunks)
+41 -7
View File
@@ -514,8 +514,15 @@ async def test_setup_auth_token_alone_satisfies_third_party_check(
# ---- _load_providers / _resolve_provider unit tests ----
def test_load_providers_returns_builtin_when_yaml_missing(tmp_path):
"""FileNotFoundError path returns the in-code defaults verbatim."""
def test_load_providers_returns_builtin_when_yaml_missing(tmp_path, monkeypatch):
"""FileNotFoundError path returns the in-code defaults verbatim.
Monkeypatches the canonical + template paths to a non-existent dir
so only the workspace config_path is in scope. Without this, the
multi-path lookup picks up the repo-root config.yaml that ships
with the template (path 2 finds the bundled providers list and
returns it instead of falling through to builtins).
"""
_install_stubs()
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if parent_dir not in sys.path:
@@ -523,6 +530,10 @@ def test_load_providers_returns_builtin_when_yaml_missing(tmp_path):
sys.modules.pop("adapter", None)
import adapter as adapter_module
nonexistent = str(tmp_path / "_isolate_canonical")
monkeypatch.setattr(adapter_module, "_CANONICAL_ADAPTER_DIR", nonexistent)
monkeypatch.setattr(adapter_module, "_TEMPLATE_DIR", nonexistent)
result = adapter_module._load_providers(str(tmp_path))
assert result == adapter_module._BUILTIN_PROVIDERS
@@ -576,8 +587,12 @@ async def test_setup_routes_extra_providers(
assert os.environ.get("ANTHROPIC_BASE_URL") == expected_url
def test_load_providers_falls_back_on_malformed_yaml(tmp_path, caplog):
"""Malformed YAML → log warning + fallback (don't kill boot)."""
def test_load_providers_falls_back_on_malformed_yaml(tmp_path, caplog, monkeypatch):
"""Malformed YAML → log warning + fallback (don't kill boot).
Isolated from the multi-path lookup by pinning canonical + template
dirs at a non-existent path; only the workspace config_path is read.
"""
_install_stubs()
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if parent_dir not in sys.path:
@@ -585,6 +600,10 @@ def test_load_providers_falls_back_on_malformed_yaml(tmp_path, caplog):
sys.modules.pop("adapter", None)
import adapter as adapter_module
nonexistent = str(tmp_path / "_isolate_canonical")
monkeypatch.setattr(adapter_module, "_CANONICAL_ADAPTER_DIR", nonexistent)
monkeypatch.setattr(adapter_module, "_TEMPLATE_DIR", nonexistent)
(tmp_path / "config.yaml").write_text("providers: [not valid yaml: {{{")
import logging
@@ -622,7 +641,7 @@ def test_resolve_provider_minimax_prefix_matches_minimax_provider():
assert result2["name"] == "minimax"
def test_load_providers_drops_bad_entry_keeps_rest(tmp_path, caplog):
def test_load_providers_drops_bad_entry_keeps_rest(tmp_path, caplog, monkeypatch):
"""Per-entry isolation: one malformed entry shouldn't nuke the registry.
Pre-fix: ``_load_providers`` built the registry via a generator inside
@@ -634,6 +653,9 @@ def test_load_providers_drops_bad_entry_keeps_rest(tmp_path, caplog):
Post-fix: per-entry try/except drops the bad entry with a warning,
rest of the registry survives.
Isolated from the multi-path lookup so only the test's tmp config.yaml
is read.
"""
_install_stubs()
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
@@ -642,6 +664,10 @@ def test_load_providers_drops_bad_entry_keeps_rest(tmp_path, caplog):
sys.modules.pop("adapter", None)
import adapter as adapter_module
nonexistent = str(tmp_path / "_isolate_canonical")
monkeypatch.setattr(adapter_module, "_CANONICAL_ADAPTER_DIR", nonexistent)
monkeypatch.setattr(adapter_module, "_TEMPLATE_DIR", nonexistent)
yaml_with_typo = textwrap.dedent("""
providers:
- name: good-zai
@@ -690,7 +716,7 @@ def test_load_providers_drops_bad_entry_keeps_rest(tmp_path, caplog):
)
def test_load_providers_string_as_prefix_does_not_split_into_chars(tmp_path, caplog):
def test_load_providers_string_as_prefix_does_not_split_into_chars(tmp_path, caplog, monkeypatch):
"""A YAML field declared as list-of-strings but written as a bare
string (operator forgot brackets) used to silently iterate over
characters → ``('m','i','m','o','-')``. Post-fix: non-list value
@@ -705,6 +731,10 @@ def test_load_providers_string_as_prefix_does_not_split_into_chars(tmp_path, cap
sys.modules.pop("adapter", None)
import adapter as adapter_module
nonexistent = str(tmp_path / "_isolate_canonical")
monkeypatch.setattr(adapter_module, "_CANONICAL_ADAPTER_DIR", nonexistent)
monkeypatch.setattr(adapter_module, "_TEMPLATE_DIR", nonexistent)
yaml_str_prefix = textwrap.dedent("""
providers:
- name: typo-prefix
@@ -723,7 +753,7 @@ def test_load_providers_string_as_prefix_does_not_split_into_chars(tmp_path, cap
)
def test_load_providers_drops_entry_without_name(tmp_path, caplog):
def test_load_providers_drops_entry_without_name(tmp_path, caplog, monkeypatch):
"""An entry without ``name`` is operator error — no silent fallback
to ``<unnamed>``. Drop the entry with a warning so the boot log
surfaces the typo.
@@ -735,6 +765,10 @@ def test_load_providers_drops_entry_without_name(tmp_path, caplog):
sys.modules.pop("adapter", None)
import adapter as adapter_module
nonexistent = str(tmp_path / "_isolate_canonical")
monkeypatch.setattr(adapter_module, "_CANONICAL_ADAPTER_DIR", nonexistent)
monkeypatch.setattr(adapter_module, "_TEMPLATE_DIR", nonexistent)
yaml_no_name = textwrap.dedent("""
providers:
- name: good
+287
View File
@@ -0,0 +1,287 @@
"""Tests for ``_resolve_model_and_provider_from_env`` — the env-vs-YAML
reconciliation that fixes the 2026-05-08 dev-tree wedge incident.
Symptom: 22/27 non-lead workspaces (minimax tier) wedged on
``Control request timeout: initialize`` because the runtime wheel's
``workspace/config.py`` interpreted ``MODEL_PROVIDER=minimax`` as the
*model id* instead of the provider slug. ``model="minimax"`` failed to
match the ``minimax-`` registry prefix, fell through to providers[0]
(anthropic-oauth), demanded ``CLAUDE_CODE_OAUTH_TOKEN`` (unset on
non-leads), and the claude CLI hung at SDK init.
The persona env files (``~/.molecule-ai/personas/<name>/env``) declare
the new convention:
* ``MODEL`` — model id (e.g. ``MiniMax-M2.7-highspeed``)
* ``MODEL_PROVIDER`` — provider slug (e.g. ``minimax``)
These tests cover the matrix of (env shape) × (YAML shape) so a future
contributor can't silently regress the wedge fix.
"""
import pytest
from adapter import (
_BUILTIN_PROVIDERS,
_resolve_model_and_provider_from_env,
)
# A registry that contains both anthropic-oauth (providers[0]) and
# minimax/zai (third-party slugs) — matches the shipped config.yaml.
_REGISTRY = _BUILTIN_PROVIDERS + (
{
"name": "minimax",
"auth_mode": "third_party_anthropic_compat",
"model_prefixes": ("minimax-",),
"model_aliases": (),
"base_url": "https://api.minimax.io/anthropic",
"auth_env": ("MINIMAX_API_KEY",),
},
{
"name": "zai",
"auth_mode": "third_party_anthropic_compat",
"model_prefixes": ("glm-",),
"model_aliases": (),
"base_url": "https://api.z.ai/api/anthropic",
"auth_env": ("GLM_API_KEY",),
},
)
def _clear_env(monkeypatch):
monkeypatch.delenv("MODEL", raising=False)
monkeypatch.delenv("MODEL_PROVIDER", raising=False)
# ------------------------------------------------------------------
# Persona env convention: MODEL=<id>, MODEL_PROVIDER=<slug>
# ------------------------------------------------------------------
def test_persona_env_minimax_resolves_correctly(monkeypatch):
"""The 2026-05-08 wedge regression test: persona env shape must
yield model=MiniMax-M2.7-highspeed (not "minimax") and explicit
provider=minimax."""
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", "MiniMax-M2.7-highspeed")
monkeypatch.setenv("MODEL_PROVIDER", "minimax")
model, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="", providers=_REGISTRY,
)
assert model == "MiniMax-M2.7-highspeed"
assert provider == "minimax"
def test_persona_env_lead_claude_code_resolves_correctly(monkeypatch):
"""Lead persona env (MODEL=opus, MODEL_PROVIDER=claude-code) —
``claude-code`` is the persona-friendly alias for the canonical
``anthropic-oauth`` registry name. Must resolve via the alias map
so the lead boots through the OAuth subscription path even when
MODEL is a non-Anthropic model id (e.g. an operator who picked
MiniMax in canvas but whose persona env still pins claude-code)."""
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", "opus")
monkeypatch.setenv("MODEL_PROVIDER", "claude-code")
model, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="", providers=_REGISTRY,
)
assert model == "opus"
# claude-code → anthropic-oauth via the alias map
assert provider == "anthropic-oauth"
def test_persona_env_lead_with_minimax_model_routes_via_oauth(monkeypatch):
"""Lead workspace whose persona pins MODEL_PROVIDER=claude-code but
whose YAML/canvas selection happens to be a MiniMax model still
routes via OAuth — the persona's provider pin wins over the
model-prefix matcher. Without the alias map, the fall-through
mis-routed leads to MiniMax even when their CLAUDE_CODE_OAUTH_TOKEN
was set."""
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", "MiniMax-M2.7")
monkeypatch.setenv("MODEL_PROVIDER", "claude-code")
model, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="", providers=_REGISTRY,
)
assert model == "MiniMax-M2.7"
assert provider == "anthropic-oauth"
def test_anthropic_alias_resolves_to_anthropic_api(monkeypatch):
"""``MODEL_PROVIDER=anthropic`` alias → ``anthropic-api`` (direct
Anthropic API key path)."""
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", "claude-opus-4-7")
monkeypatch.setenv("MODEL_PROVIDER", "anthropic")
model, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="", providers=_REGISTRY,
)
assert model == "claude-opus-4-7"
assert provider == "anthropic-api"
def test_persona_env_glm_resolves_correctly(monkeypatch):
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", "GLM-4.6")
monkeypatch.setenv("MODEL_PROVIDER", "zai")
model, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="", providers=_REGISTRY,
)
assert model == "GLM-4.6"
assert provider == "zai"
def test_env_provider_slug_case_insensitive(monkeypatch):
"""Operator typos like ``MiniMax`` (mixed case) still resolve."""
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", "MiniMax-M2.7-highspeed")
monkeypatch.setenv("MODEL_PROVIDER", "MiniMax") # mixed case
_, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="", providers=_REGISTRY,
)
assert provider == "MiniMax" # caller compares case-insensitively
# ------------------------------------------------------------------
# Legacy convention: MODEL_PROVIDER=<model-id>, MODEL unset
# ------------------------------------------------------------------
def test_legacy_model_provider_as_model_id_still_works(monkeypatch):
"""Pre-2026-05-08 canvas Save+Restart shape: MODEL_PROVIDER carried
the model id directly (e.g. ``MODEL_PROVIDER=MiniMax-M2.7``) and
no MODEL env. Must keep working so existing canvas users don't
break overnight."""
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL_PROVIDER", "MiniMax-M2.7-highspeed")
model, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="", providers=_REGISTRY,
)
# MiniMax-M2.7-highspeed is not a registered provider name, so
# it's treated as a legacy model-id-in-MODEL_PROVIDER value.
assert model == "MiniMax-M2.7-highspeed"
assert provider is None
# ------------------------------------------------------------------
# Env wins over YAML
# ------------------------------------------------------------------
def test_env_model_wins_over_yaml_model(monkeypatch):
"""When both env MODEL and YAML model are set, env wins."""
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", "GLM-4.6")
model, _ = _resolve_model_and_provider_from_env(
yaml_model="MiniMax-M2.7", yaml_provider="", providers=_REGISTRY,
)
assert model == "GLM-4.6"
def test_env_provider_wins_over_yaml_provider(monkeypatch):
"""Env MODEL_PROVIDER (when a registered slug) wins over YAML provider."""
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", "GLM-4.6")
monkeypatch.setenv("MODEL_PROVIDER", "zai")
_, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="minimax", providers=_REGISTRY,
)
assert provider == "zai"
# ------------------------------------------------------------------
# YAML fallback (no env)
# ------------------------------------------------------------------
def test_no_env_falls_back_to_yaml(monkeypatch):
"""Workspace whose env doesn't set MODEL/MODEL_PROVIDER falls back
to the YAML config — preserves existing operator workflows."""
_clear_env(monkeypatch)
model, provider = _resolve_model_and_provider_from_env(
yaml_model="claude-sonnet-4-6",
yaml_provider="anthropic-api",
providers=_REGISTRY,
)
assert model == "claude-sonnet-4-6"
assert provider == "anthropic-api"
def test_no_env_no_yaml_returns_empty(monkeypatch):
"""Pure default path — caller (setup) substitutes ``sonnet``."""
_clear_env(monkeypatch)
model, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="", providers=_REGISTRY,
)
assert model == ""
assert provider is None
def test_yaml_provider_anthropic_is_aliased_to_anthropic_api(monkeypatch):
"""Regression for 2026-05-09 staging-cplead-2 incident: every
workspace booted ``configuration_status=not_configured`` because the
molecule-runtime wheel auto-derives ``runtime_config.provider =
"anthropic"`` from the default model slug ``anthropic:claude-opus-4-7``.
The adapter received ``yaml_provider="anthropic"`` from the wheel and
rejected it with ``ValueError: provider='anthropic' but it is not in
the providers registry`` — but ``anthropic`` is already in
``_PROVIDER_SLUG_ALIASES`` for the env-var path. Mirror the alias map
on the YAML path so the wheel default produces a registered provider
name."""
_clear_env(monkeypatch)
_, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="anthropic", providers=_REGISTRY,
)
assert provider == "anthropic-api", (
f"yaml_provider='anthropic' must resolve through the alias map to "
f"'anthropic-api'; got {provider!r}. Without this aliasing the "
f"wheel-default workspace boot wedges at adapter.setup()."
)
def test_yaml_provider_claude_code_is_aliased_to_anthropic_oauth(monkeypatch):
"""Symmetric coverage: persona-friendly ``claude-code`` slug from the
YAML ``provider:`` field must alias to ``anthropic-oauth``, the same
way the env-var path resolves it. Lead workspaces that pin the OAuth
path in YAML (instead of via env) must not wedge."""
_clear_env(monkeypatch)
_, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="claude-code", providers=_REGISTRY,
)
assert provider == "anthropic-oauth"
def test_yaml_provider_unknown_passes_through_for_actionable_error(monkeypatch):
"""An unaliased, unknown YAML provider (e.g. ``yaml_provider="mystery"``)
must NOT be silently swapped to providers[0] — it must reach
``_resolve_provider`` so the adapter raises the actionable
``Known providers: ...`` error message. The alias map is a
convenience for the two persona-convention slugs only; everything
else must keep its original semantics."""
_clear_env(monkeypatch)
_, provider = _resolve_model_and_provider_from_env(
yaml_model="", yaml_provider="mystery", providers=_REGISTRY,
)
assert provider == "mystery"
# ------------------------------------------------------------------
# Whitespace / empty-value defensive cases
# ------------------------------------------------------------------
def test_whitespace_only_env_treated_as_unset(monkeypatch):
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", " ")
monkeypatch.setenv("MODEL_PROVIDER", " ")
model, provider = _resolve_model_and_provider_from_env(
yaml_model="opus", yaml_provider="", providers=_REGISTRY,
)
assert model == "opus"
assert provider is None
def test_empty_env_value_treated_as_unset(monkeypatch):
_clear_env(monkeypatch)
monkeypatch.setenv("MODEL", "")
monkeypatch.setenv("MODEL_PROVIDER", "")
model, provider = _resolve_model_and_provider_from_env(
yaml_model="sonnet", yaml_provider="", providers=_REGISTRY,
)
assert model == "sonnet"
assert provider is None
+221
View File
@@ -0,0 +1,221 @@
"""Pin the CLI-stream-error surfacing in _run_query + _format_process_error.
When the `claude` CLI errors mid-stream, the claude-agent-sdk throws a bare
``Exception("Command failed with exit code 1 …")`` whose only text is the
useless ``Check stderr output for details`` placeholder — but the *actual*
failure reason (model 404, rate limit, auth) arrived a moment earlier as a
stream-json ``ResultMessage(is_error=True)`` carrying ``result`` text and
``api_error_status``. ``_run_query`` salvages that onto the exception
(``_molecule_stream_detail``); ``_format_process_error`` surfaces it.
Regression context: the 2026-05-10 dev-team incident — six lead workspaces
404ing on every turn (``--model claude-code`` → ``api_error_status=404``,
"There's an issue with the selected model (claude-code)"), invisible for an
hour because the CLI wrote nothing to stderr and that text was thrown away.
See internal#226.
Stub pattern mirrors test_runtime_wedge_mirror.py — same _ensure_module /
_ensure_attr / _load_executor helpers so a real-package install on a
workstation still wins over the stubs.
"""
import os
import sys
import types
from unittest.mock import MagicMock
import pytest
# ---- Stubs (mirror test_runtime_wedge_mirror.py) ----
def _ensure_module(dotted: str) -> types.ModuleType:
if dotted not in sys.modules:
sys.modules[dotted] = types.ModuleType(dotted)
return sys.modules[dotted]
def _ensure_attr(mod: types.ModuleType, name: str, value: object) -> None:
if not hasattr(mod, name):
setattr(mod, name, value)
def _install_executor_stubs():
sdk = _ensure_module("claude_agent_sdk")
_ensure_attr(sdk, "ClaudeAgentOptions", MagicMock(name="ClaudeAgentOptions"))
_ensure_attr(sdk, "AssistantMessage", type("AssistantMessage", (), {}))
_ensure_attr(sdk, "TextBlock", type("TextBlock", (), {}))
_ensure_attr(sdk, "ResultMessage", type("ResultMessage", (), {}))
_ensure_attr(sdk, "query", MagicMock(name="query"))
_ensure_module("a2a")
_ensure_module("a2a.server")
a2a_exec = _ensure_module("a2a.server.agent_execution")
_ensure_attr(a2a_exec, "AgentExecutor", type("AgentExecutor", (), {}))
_ensure_attr(a2a_exec, "RequestContext", type("RequestContext", (), {}))
a2a_events = _ensure_module("a2a.server.events")
_ensure_attr(a2a_events, "EventQueue", type("EventQueue", (), {}))
a2a_helpers = _ensure_module("a2a.helpers")
_ensure_attr(a2a_helpers, "new_text_message", lambda *_a, **_kw: None)
_ensure_module("molecule_runtime")
rw = _ensure_module("molecule_runtime.runtime_wedge")
_ensure_attr(rw, "mark_wedged", lambda *_a, **_kw: None)
_ensure_attr(rw, "clear_wedge", lambda *_a, **_kw: None)
helpers = _ensure_module("molecule_runtime.executor_helpers")
for name in (
"auto_push_hook", "brief_summary", "collect_outbound_files", "commit_memory",
"extract_attached_files", "extract_message_text", "get_a2a_instructions",
"get_hma_instructions", "read_delegation_results", "recall_memories",
"set_current_task",
):
_ensure_attr(helpers, name, lambda *_a, **_kw: ("" if "instr" in name or "summary" in name else None))
_ensure_attr(helpers, "collect_outbound_files", lambda *_a, **_kw: [])
_ensure_attr(helpers, "extract_attached_files", lambda *_a, **_kw: [])
_ensure_attr(helpers, "get_mcp_server_path", lambda *_a, **_kw: "/dev/null")
_ensure_attr(helpers, "get_system_prompt", lambda *_a, **_kw: "")
_ensure_attr(helpers, "sanitize_agent_error", lambda e: str(e))
_ensure_attr(helpers, "CONFIG_MOUNT", "/configs")
_ensure_attr(helpers, "WORKSPACE_MOUNT", "/workspace")
_ensure_attr(helpers, "MEMORY_CONTENT_MAX_CHARS", 10000)
def _load_executor():
_install_executor_stubs()
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if parent_dir not in sys.path:
sys.path.insert(0, parent_dir)
sys.modules.pop("claude_sdk_executor", None)
import claude_sdk_executor # noqa: WPS433
return claude_sdk_executor
def _async_stream(messages, raise_at_end=None):
"""Build a fake `sdk.query(...)` return value: an async iterator that
yields ``messages`` then (optionally) raises — exactly the shape the
claude-agent-sdk produces when the CLI errors after emitting a
ResultMessage(is_error)."""
class _Stream:
def __aiter__(self):
return self
def __init__(self):
self._it = iter(messages)
async def __anext__(self):
try:
return next(self._it)
except StopIteration:
if raise_at_end is not None:
raise raise_at_end
raise StopAsyncIteration
return _Stream()
# ─── _format_process_error: surface the salvaged stream detail ───────────
def test_format_process_error_surfaces_molecule_stream_detail():
mod = _load_executor()
exc = Exception("Command failed with exit code 1 — Check stderr output for details")
exc._molecule_stream_detail = (
'api_error_status=404 result="There\'s an issue with the selected '
'model (claude-code). It may not exist or you may not have access."'
)
out = mod._format_process_error(exc)
assert "cli_stream_error=" in out
assert "api_error_status=404" in out
assert "claude-code" in out
# When we already salvaged the real error, don't ALSO re-probe the CLI
# (the probe can't replay the failing --model argv and may mislead).
assert "probed_cli_error" not in out
def test_format_process_error_still_probes_when_no_stream_detail(monkeypatch):
"""The #160 fallback (probe the CLI when only the swallowed-stderr
placeholder is present) still fires when _run_query had nothing to
salvage."""
mod = _load_executor()
monkeypatch.setattr(mod, "_probe_claude_cli_error", lambda: "You've hit your limit · resets Apr 17")
exc = Exception("Command failed with exit code 1 — Check stderr output for details")
out = mod._format_process_error(exc)
assert "probed_cli_error=" in out
assert "hit your limit" in out
def test_format_process_error_stream_detail_takes_precedence_over_probe(monkeypatch):
mod = _load_executor()
probe = MagicMock(name="_probe_claude_cli_error", return_value="<should not be called>")
monkeypatch.setattr(mod, "_probe_claude_cli_error", probe)
exc = Exception("Command failed with exit code 1 — Check stderr output for details")
exc._molecule_stream_detail = "api_error_status=429 result='rate limited'"
out = mod._format_process_error(exc)
assert "cli_stream_error=" in out
probe.assert_not_called()
# ─── _run_query: salvage the detail onto the raised exception ────────────
@pytest.mark.asyncio
async def test_run_query_annotates_exception_from_is_error_result_message():
mod = _load_executor()
sdk = sys.modules["claude_agent_sdk"]
rm = sdk.ResultMessage()
rm.session_id = "sess-1"
rm.result = "There's an issue with the selected model (claude-code)."
rm.is_error = True
rm.api_error_status = 404
boom = Exception("Command failed with exit code 1 (exit code: 1)\nCheck stderr output for details")
sdk.query = MagicMock(return_value=_async_stream([rm], raise_at_end=boom))
ex = mod.ClaudeSDKExecutor(system_prompt="", config_path="/tmp", heartbeat=None, model="opus")
with pytest.raises(Exception) as ei:
await ex._run_query("hi", options=MagicMock())
detail = getattr(ei.value, "_molecule_stream_detail", None)
assert detail is not None
assert "api_error_status=404" in detail
assert "claude-code" in detail
# And it threads through the formatter the executor's error path uses.
assert "cli_stream_error=" in mod._format_process_error(ei.value)
@pytest.mark.asyncio
async def test_run_query_falls_back_to_assistant_text_when_no_error_result():
mod = _load_executor()
sdk = sys.modules["claude_agent_sdk"]
tb = sdk.TextBlock()
tb.text = "helpful pre-crash context from the model"
am = sdk.AssistantMessage()
am.content = [tb]
boom = Exception("Command failed with exit code 1 — Check stderr output for details")
sdk.query = MagicMock(return_value=_async_stream([am], raise_at_end=boom))
ex = mod.ClaudeSDKExecutor(system_prompt="", config_path="/tmp", heartbeat=None, model="opus")
with pytest.raises(Exception) as ei:
await ex._run_query("hi", options=MagicMock())
assert getattr(ei.value, "_molecule_stream_detail", None) == "helpful pre-crash context from the model"
@pytest.mark.asyncio
async def test_run_query_clean_success_unaffected():
"""No exception → no annotation, normal QueryResult."""
mod = _load_executor()
sdk = sys.modules["claude_agent_sdk"]
rm = sdk.ResultMessage()
rm.session_id = "sess-ok"
rm.result = "done"
rm.is_error = False
sdk.query = MagicMock(return_value=_async_stream([rm]))
ex = mod.ClaudeSDKExecutor(system_prompt="", config_path="/tmp", heartbeat=None, model="opus")
res = await ex._run_query("hi", options=MagicMock())
assert res.text == "done"
assert res.session_id == "sess-ok"