fix(runtime+canvas): surface actionable provider error reason instead of opaque "Agent error (Exception)"
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m38s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m38s
CI / Platform (Go) (pull_request) Successful in 7m2s
CI / Python Lint & Test (pull_request) Successful in 6m39s
CI / Canvas (Next.js) (pull_request) Successful in 7m56s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Blocked by required conditions
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 9s
E2E API Smoke Test / detect-changes (pull_request) Successful in 12s
E2E Chat / detect-changes (pull_request) Successful in 10s
Harness Replays / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 10s
publish-runtime-autobump / bump-and-tag (pull_request) Has been skipped
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Failing after 6s
gate-check-v3 / gate-check (pull_request) Successful in 6s
qa-review / approved (pull_request) Successful in 6s
security-review / approved (pull_request) Successful in 6s
sop-checklist / na-declarations (pull_request) N/A: (none)
publish-runtime-autobump / pr-validate (pull_request) Successful in 33s
sop-checklist / all-items-acked (pull_request) Successful in 6s
sop-tier-check / tier-check (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m9s
E2E Chat / E2E Chat (pull_request) Failing after 13s
Harness Replays / Harness Replays (pull_request) Successful in 2s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Failing after 55s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m38s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2m38s
CI / Platform (Go) (pull_request) Successful in 7m2s
CI / Python Lint & Test (pull_request) Successful in 6m39s
CI / Canvas (Next.js) (pull_request) Successful in 7m56s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Blocked by required conditions
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 9s
E2E API Smoke Test / detect-changes (pull_request) Successful in 12s
E2E Chat / detect-changes (pull_request) Successful in 10s
Harness Replays / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 10s
publish-runtime-autobump / bump-and-tag (pull_request) Has been skipped
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 11s
Secret scan / Scan diff for credential-shaped strings (pull_request) Failing after 6s
gate-check-v3 / gate-check (pull_request) Successful in 6s
qa-review / approved (pull_request) Successful in 6s
security-review / approved (pull_request) Successful in 6s
sop-checklist / na-declarations (pull_request) N/A: (none)
publish-runtime-autobump / pr-validate (pull_request) Successful in 33s
sop-checklist / all-items-acked (pull_request) Successful in 6s
sop-tier-check / tier-check (pull_request) Successful in 6s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m9s
E2E Chat / E2E Chat (pull_request) Failing after 13s
Harness Replays / Harness Replays (pull_request) Successful in 2s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Failing after 55s
internal#212 (P0 from internal#211). When the embedded `claude` CLI emits a terminal result message with is_error=true (e.g. 403 oauth_org_not_allowed "Your organization has disabled Claude subscription access · Use an Anthropic API key instead, or ask your admin to enable access"), the user saw only `Agent error (Exception) — see workspace logs for details.` — a dead end (no such logs UI) that discards the exact secret-safe, actionable text the user needs. Root cause was a multi-cut loss of the CLI's result/error/api_error_status: cut #2 sanitize_agent_error reduced every failure to type(exc).__name__. → add a `reason` passthrough: a pre-curated, user-actionable, secret-safe explanation is surfaced verbatim (still scrubbed for key/token/bearer as a second pass). reason wins over stderr; omitting it preserves the prior generic behavior exactly. cut #3a workspace-server dropped error_detail from the live ACTIVITY_LOGGED websocket broadcast (it was persisted to the DB column but never sent), so the canvas had nothing to render. → include error_detail in the broadcast payload (already capped at 4096 by the runtime's report_activity helper). cut #3b canvas useChatSocket hardcoded the opaque string, ignoring even the activity summary. → render error_detail (fallback: summary, then a generic retry hint). The dead "see workspace logs for details." phrase that pointed at nonexistent UI is removed (a full logs tab is a separate larger follow-up, not this PR — reason-first per CTO). The runtime-side cut #1 (template-claude-code claude_sdk_executor._run_query ignoring is_error and the SDK collapsing errors[] to the bare subtype "success") is fixed in a stacked PR on molecule-ai-workspace-template-claude-code (depends on this PR's sanitize_agent_error `reason` kwarg, which ships via the molecule-ai-workspace-runtime package). Tests: 4 new sanitize_agent_error reason tests (verbatim surfacing, secret scrub still applied, reason>stderr precedence, no-reason unchanged). Verified fail-before / pass-after; full sanitize suite green; no new regressions (the 2 pre-existing test_get_a2a_instructions_mcp failures are unrelated). Refs: internal#211, internal#212 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -67,9 +67,21 @@ export function useChatSocket(
|
||||
const own = (targetId || msg.workspace_id) === workspaceId;
|
||||
if (own) {
|
||||
callbacksRef.current.onSendComplete?.();
|
||||
callbacksRef.current.onSendError?.(
|
||||
"Agent error (Exception) — see workspace logs for details.",
|
||||
);
|
||||
// internal#211/#212: surface the runtime's curated,
|
||||
// user-actionable reason (provider HTTP status + error
|
||||
// code + the provider's own guidance, e.g. a 403 "org
|
||||
// disabled · use an API key / ask your admin"). The
|
||||
// server now includes error_detail in the ACTIVITY_LOGGED
|
||||
// broadcast; fall back to summary, and only as a last
|
||||
// resort to a generic line. The old hardcoded
|
||||
// "Agent error (Exception) — see workspace logs for
|
||||
// details." string pointed at a logs UI that does not
|
||||
// exist and discarded the actionable reason entirely.
|
||||
const detail =
|
||||
(p.error_detail as string) ||
|
||||
(p.summary as string) ||
|
||||
"The agent turn failed but the runtime reported no detail. Retry once; if it repeats the workspace runtime may need a restart.";
|
||||
callbacksRef.current.onSendError?.(detail);
|
||||
}
|
||||
}
|
||||
} else if (type === "a2a_send") {
|
||||
|
||||
@@ -691,6 +691,19 @@ func logActivityExec(ctx context.Context, exec activityExecutor, broadcaster eve
|
||||
if respStr != nil {
|
||||
payload["response_body"] = json.RawMessage(respJSON)
|
||||
}
|
||||
// internal#211/#212: error_detail carries the runtime's curated,
|
||||
// user-actionable, secret-safe failure reason (provider HTTP
|
||||
// status + error code + the provider's own guidance, e.g. a 403
|
||||
// "org disabled · use an API key / ask your admin"). It is
|
||||
// already persisted to the DB column above and capped by the
|
||||
// runtime's report_activity helper (4096 chars). Previously it
|
||||
// was dropped from the LIVE broadcast, so the canvas had nothing
|
||||
// to render and fell back to a hardcoded opaque
|
||||
// "Agent error (Exception) — see workspace logs" string. Include
|
||||
// it so the chat bubble shows the real reason in real time.
|
||||
if params.ErrorDetail != nil && *params.ErrorDetail != "" {
|
||||
payload["error_detail"] = *params.ErrorDetail
|
||||
}
|
||||
}
|
||||
|
||||
return func() {
|
||||
|
||||
@@ -608,6 +608,7 @@ def sanitize_agent_error(
|
||||
exc: BaseException | None = None,
|
||||
category: str | None = None,
|
||||
stderr: str | None = None,
|
||||
reason: str | None = None,
|
||||
) -> str:
|
||||
"""Render an agent-side failure into a user-safe error message.
|
||||
|
||||
@@ -615,6 +616,18 @@ def sanitize_agent_error(
|
||||
category string (e.g. from `classify_subprocess_error`). If both are
|
||||
given, `category` wins. If neither, the tag defaults to "unknown".
|
||||
|
||||
When ``reason`` is provided (internal#211/#212), it is a *pre-curated,
|
||||
user-actionable, secret-safe* explanation built by the caller from a
|
||||
provider-side failure — e.g. a 403 "Your organization has disabled
|
||||
Claude subscription access · Use an Anthropic API key instead, or ask
|
||||
your admin to enable access" with error code ``oauth_org_not_allowed``.
|
||||
This text is exactly what the user needs to self-serve, so it is
|
||||
surfaced VERBATIM as the message instead of being collapsed to the
|
||||
opaque exception class name. It still passes through the
|
||||
key/token/bearer/path scrubber as a belt-and-braces second pass so a
|
||||
buggy caller can't leak a credential that snuck into the reason.
|
||||
``reason`` wins over ``stderr``; both lose to neither being set.
|
||||
|
||||
When ``stderr`` is provided (e.g. the first ~1 KB of a subprocess stderr
|
||||
or HTTP error body), it is sanitized and appended to the output so the
|
||||
A2A caller gets actionable context without needing to dig through workspace
|
||||
@@ -629,6 +642,13 @@ def sanitize_agent_error(
|
||||
else:
|
||||
tag = "unknown"
|
||||
|
||||
if reason:
|
||||
# Curated, user-actionable reason — surface it as the message.
|
||||
# Still scrub: a 403/auth/quota message is safe, but the scrubber
|
||||
# is cheap insurance against a caller that didn't curate cleanly.
|
||||
clean = _sanitize_for_external(reason[:_MAX_STDERR_PREVIEW])
|
||||
return f"Agent error ({tag}): {clean}"
|
||||
|
||||
if stderr:
|
||||
# Truncate and sanitize before including — prevents DoS via
|
||||
# a malicious or buggy peer injecting a huge error body, and
|
||||
|
||||
@@ -788,6 +788,64 @@ def test_sanitize_agent_error_stderr_combined_with_existing_tests():
|
||||
assert "workspace logs" in out
|
||||
|
||||
|
||||
# ─── reason passthrough (internal#211/#212: surface actionable provider error) ───
|
||||
|
||||
|
||||
def test_sanitize_agent_error_reason_surfaced_verbatim():
|
||||
"""A curated provider reason is shown to the user, not collapsed to the
|
||||
exception class name. This is the internal#211 regression: a 403
|
||||
org-disabled message must reach the canvas."""
|
||||
reason = (
|
||||
"provider HTTP 403 — oauth_org_not_allowed — Your organization has "
|
||||
"disabled Claude subscription access for Claude Code · Use an "
|
||||
"Anthropic API key instead, or ask your admin to enable access"
|
||||
)
|
||||
|
||||
class _ResultErr(Exception):
|
||||
pass
|
||||
|
||||
out = sanitize_agent_error(exc=_ResultErr("opaque"), reason=reason)
|
||||
# The actionable provider guidance and status code must be visible.
|
||||
assert "403" in out
|
||||
assert "oauth_org_not_allowed" in out
|
||||
assert "disabled Claude subscription access" in out
|
||||
assert "ask your admin to enable access" in out
|
||||
# NOT the old opaque form.
|
||||
assert "see workspace logs" not in out
|
||||
|
||||
|
||||
def test_sanitize_agent_error_reason_still_scrubs_secrets():
|
||||
"""Even on the reason path the key/token scrubber runs — a buggy caller
|
||||
that lets a bearer token into the reason still gets it redacted."""
|
||||
leaky = (
|
||||
"provider HTTP 401 — auth failed — Authorization: Bearer "
|
||||
"sk-ant-DEADBEEFDEADBEEFDEADBEEF0123456789abcdef please re-auth"
|
||||
)
|
||||
out = sanitize_agent_error(reason=leaky)
|
||||
assert "[REDACTED]" in out
|
||||
assert "sk-ant-DEADBEEFDEADBEEFDEADBEEF0123456789abcdef" not in out
|
||||
# The non-secret guidance still survives the scrub.
|
||||
assert "401" in out
|
||||
assert "please re-auth" in out
|
||||
|
||||
|
||||
def test_sanitize_agent_error_reason_wins_over_stderr():
|
||||
"""When both reason and stderr are passed, the curated reason wins."""
|
||||
out = sanitize_agent_error(
|
||||
reason="provider HTTP 403 — use an API key",
|
||||
stderr="raw subprocess noise that should not be shown",
|
||||
)
|
||||
assert "use an API key" in out
|
||||
assert "raw subprocess noise" not in out
|
||||
|
||||
|
||||
def test_sanitize_agent_error_no_reason_unchanged():
|
||||
"""Omitting reason preserves the original generic behavior."""
|
||||
out = sanitize_agent_error(exc=ValueError("boom"))
|
||||
assert "ValueError" in out
|
||||
assert "workspace logs" in out
|
||||
|
||||
|
||||
|
||||
# ======================================================================
|
||||
# classify_subprocess_error
|
||||
|
||||
Reference in New Issue
Block a user