Compare commits

..

9 Commits

Author SHA1 Message Date
core-devops 8399e8b525 fix(queue): correct status deduplication order so newest entry wins
sop-checklist / all-items-acked (pull_request) [info tier:low] acked: 5/7 — missing: root-cause, no-backwards-compat
sop-checklist / na-declarations (pull_request) N/A: (none)
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
E2E Chat / detect-changes (pull_request) Successful in 7s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 7s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 57s
gate-check-v3 / gate-check (pull_request) Successful in 3s
qa-review / approved (pull_request) Failing after 3s
security-review / approved (pull_request) Failing after 3s
sop-tier-check / tier-check (pull_request) Successful in 4s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m5s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
CI / Platform (Go) (pull_request) Successful in 4m21s
E2E Chat / E2E Chat (pull_request) Successful in 1s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 1s
CI / Canvas (Next.js) (pull_request) Successful in 5m43s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 6m19s
CI / all-required (pull_request) Successful in 6m22s
The queue was incorrectly seeing main's CI/all-required (push) as
"pending" instead of "success". Two bugs interacting:

1. latest_statuses_by_context guard was wrong: `ids[-1] > ids[0]`
   detected ascending but the combined /statuses array is DESCENDING
   (ids 393→1). Fix: `ids[-1] < ids[0]` detects descending and
   reverses so ascending iteration makes newest last → wins.

2. get_combined_status sorted merged entries DESCENDING then deduplicated
   by iterating forward — the last occurrence won. But when /status
   base entries (low ids) are appended AFTER /statuses (high ids), the
   same-context entries from base appear LAST after descending sort,
   overwriting newer entries from /statuses. Fix: return merged list
   sorted ASCENDING and drop the inline dedup; let
   latest_statuses_by_context handle dedup correctly.

Test names clarified: ascending-input test now named
test_latest_statuses_ascending_input_newest_wins (the base /status
case); descending-input test renamed
test_latest_statuses_guard_reverses_descending_input (the /statuses
case). Both verify newest (largest id) wins.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:29:57 +00:00
core-devops 5e47d2e385 fix(queue): query merge-queue label by name not resolved ID
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 12s
E2E API Smoke Test / detect-changes (pull_request) Successful in 5s
E2E Chat / detect-changes (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 3s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 56s
gate-check-v3 / gate-check (pull_request) Successful in 3s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 53s
qa-review / approved (pull_request) Failing after 3s
sop-checklist / na-declarations (pull_request) N/A: (none)
security-review / approved (pull_request) Failing after 3s
sop-checklist / all-items-acked (pull_request) Successful in 3s
sop-tier-check / tier-check (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 4m34s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 6m14s
E2E Chat / E2E Chat (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 1s
CI / Python Lint & Test (pull_request) Successful in 6m28s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 5m52s
Gitea orders /issues?labels=<id> by PR number ascending with limit
applied before PR #1233 appears — the 50-result page starts at PR #1309
and misses #1233 entirely. Querying by label name returns #1233
correctly. Drop the _ensure_label_ids() startup call (one less API
round-trip per tick) and the now-dead _QUEUE_LABEL_ID/_HOLD_LABEL_ID
globals. Resolves the queue label query bug root-causing SEV-1 #487.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 09:14:36 +00:00
core-devops 6c06227871 fix(queue): correct latest_statuses_by_context guard for descending input
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 2s
CI / Detect changes (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
E2E API Smoke Test / detect-changes (pull_request) Successful in 6s
E2E Chat / detect-changes (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 58s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 6s
gate-check-v3 / gate-check (pull_request) Successful in 6s
qa-review / approved (pull_request) Failing after 6s
security-review / approved (pull_request) Failing after 5s
sop-tier-check / tier-check (pull_request) Successful in 4s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 2s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m2s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1s
CI / Platform (Go) (pull_request) Successful in 4m18s
CI / Canvas (Next.js) (pull_request) Successful in 5m33s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 6m23s
CI / all-required (pull_request) Successful in 6m36s
Gitea /statuses returns newest-first (desc id order). After
get_combined_status sorts by id descending, the combined list is also
descending. The old guard `ids[-1] > ids[0]` detected ascending input
but NOT descending — for main (130+ statuses) the guard did not fire,
causing forward iteration to grab the newest entry instead of the oldest
(which is the correct authoritative status when iterating a descending
list). The fix inverts the comparison to `ids[-1] < ids[0]`, so that
descending input triggers reversal and the oldest (authoritative) entry
per context wins. Ascending test fixtures work unchanged.

Also adds explicit-id test fixture for the ascending-guard case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 08:41:36 +00:00
core-devops f6abdb9dc1 fix(queue): proper merge of base + extended statuses by id sort
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
CI / Platform (Go) (pull_request) Successful in 4m4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 3s
E2E Chat / detect-changes (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 4s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 57s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 2s
CI / Canvas (Next.js) (pull_request) Successful in 5m25s
qa-review / approved (pull_request) Failing after 2s
security-review / approved (pull_request) Failing after 2s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 56s
CI / Python Lint & Test (pull_request) Successful in 6m29s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 3s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 5s
CI / all-required (pull_request) Successful in 6m37s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
gate-check-v3 / gate-check (pull_request) Successful in 6s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 13s
sop-tier-check / tier-check (pull_request) Successful in 4s
sop-checklist / all-items-acked (pull_request) acked: 5/7 — missing: root-cause, no-backwards-compat
sop-checklist / na-declarations (pull_request) N/A: (none)
The previous supplement logic only added contexts MISSING from base, but didn't
overwrite base entries with newer statuses from /statuses. Result: stale
"failure" entries from base (id=27) overwrote newer "pending" entries from
/statuses (id=25) because supplement only filled gaps.

Fix: collect all entries from both /status (base) and /statuses (extended),
sort by id descending (highest = newest), and iterate in that order so the
newest entry for each context wins regardless of source.

The combined statuses[] is now correct for all cases:
- Newest in base only: wins (from sorted iteration)
- Newest in extended only: wins (supplements base)
- Newest in base, older in extended: wins (base entry processed later in sort)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 08:11:58 +00:00
core-devops ec79a6bb20 fix(queue): supplement statuses overwrite base, not just fill gaps
sop-checklist / all-items-acked (pull_request) Successful in 2s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 2s
CI / Detect changes (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 9s
CI / Platform (Go) (pull_request) Successful in 4m4s
E2E API Smoke Test / detect-changes (pull_request) Successful in 4s
E2E Chat / detect-changes (pull_request) Successful in 3s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 3s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 2s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 50s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 3s
CI / Canvas (Next.js) (pull_request) Successful in 5m25s
gate-check-v3 / gate-check (pull_request) Successful in 2s
qa-review / approved (pull_request) Failing after 2s
sop-checklist / na-declarations (pull_request) N/A: (none)
security-review / approved (pull_request) Failing after 3s
sop-tier-check / tier-check (pull_request) Successful in 3s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 56s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2s
CI / Python Lint & Test (pull_request) Successful in 6m27s
CI / all-required (pull_request) Successful in 6m22s
The base /status endpoint returns only 26-30 entries; newer statuses for
the same context may not be in the base array. The supplement logic
was only adding contexts MISSING from base, but the base already contained
an old "pending" entry for CI/all-required while the newer "success" entry
was beyond the base array's cutoff. Now the supplement OVERWRITES base
entries for the same context so newer statuses always win.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 08:02:52 +00:00
core-devops f8d4512e1f fix(queue): correct status ordering and supplement missing contexts
CI / Detect changes (pull_request) Successful in 4s
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 2s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 9s
CI / Platform (Go) (pull_request) Successful in 4m42s
E2E API Smoke Test / detect-changes (pull_request) Successful in 5s
E2E Chat / detect-changes (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 3s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 4s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 2s
gate-check-v3 / gate-check (pull_request) Successful in 2s
qa-review / approved (pull_request) Failing after 4s
security-review / approved (pull_request) Failing after 2s
sop-checklist / all-items-acked (pull_request) Successful in 2s
sop-tier-check / tier-check (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 52s
CI / Canvas (Next.js) (pull_request) Successful in 6m13s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 53s
CI / Python Lint & Test (pull_request) Successful in 6m27s
CI / all-required (pull_request) Successful in 6m22s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 2s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 1s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Two related fixes to get_combined_status() + latest_statuses_by_context():

1. Ordering: Gitea /statuses returns entries in DESCENDING id order
   (newest first). The script was reversing, treating it as ascending,
   which made the OLDEST entry win instead of the newest. Now iterate
   forward so newer entries overwrite older ones (newest wins).

2. Context gaps: The /status endpoint returns only 30 statuses in its
   statuses[] array. The /statuses endpoint (limit=100) may not include
   all contexts from /status. Now merge: start with /status's statuses[]
   (authoritative, ascending), supplement missing contexts from
   /statuses (descending, reversed for correct iteration order).

Also fixes test_latest_statuses_dedupes_by_context_newest_first to
assert the correct "newest wins" semantics.

PR #1403 now correctly shows ready=True action=merge with this fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 07:59:39 +00:00
core-devops b0ec931595 fix(queue): resolve merge-queue label by ID not name
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 4s
CI / Detect changes (pull_request) Successful in 5s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 9s
CI / Platform (Go) (pull_request) Successful in 5m10s
E2E API Smoke Test / detect-changes (pull_request) Successful in 5s
E2E Chat / detect-changes (pull_request) Successful in 4s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 4s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m1s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
CI / Canvas (Next.js) (pull_request) Successful in 6m34s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
gate-check-v3 / gate-check (pull_request) Successful in 4s
qa-review / approved (pull_request) Failing after 3s
security-review / approved (pull_request) Failing after 4s
sop-checklist / na-declarations (pull_request) N/A: (none)
sop-checklist / all-items-acked (pull_request) Successful in 3s
sop-tier-check / tier-check (pull_request) Successful in 4s
CI / Python Lint & Test (pull_request) Successful in 6m54s
CI / all-required (pull_request) Successful in 6m9s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 2s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 55s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2s
Gitea allows multiple repo labels with the same name but different
colours. The /issues endpoint with labels=<name> matches at most one
of them — not reliably the canonical colour. This caused
list_queued_issues() to miss PRs that only had the canonical
merge-queue label (id=27, colour 1f883d) when duplicates with a
different colour existed in the repo.

Fix: _resolve_label_id() looks up the label's numeric id at startup
and list_queued_issues() queries by that id instead of the name.
This is stable regardless of how many duplicate labels exist.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 07:40:42 +00:00
core-devops 8ccf3a844c fix(queue): surface merge API errors instead of silent catch
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 8s
E2E API Smoke Test / detect-changes (pull_request) Successful in 7s
E2E Chat / detect-changes (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 5s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 3s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 56s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 5s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
gate-check-v3 / gate-check (pull_request) Successful in 3s
qa-review / approved (pull_request) Failing after 3s
security-review / approved (pull_request) Failing after 3s
sop-tier-check / tier-check (pull_request) Successful in 3s
CI / Platform (Go) (pull_request) Successful in 4m11s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2s
E2E Chat / E2E Chat (pull_request) Successful in 2s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 1s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m2s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4 — body-unfilled: comprehensive-testing, local-postgres-e2
sop-checklist / na-declarations (pull_request) N/A: (none)
CI / Canvas (Next.js) (pull_request) Successful in 5m33s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / Python Lint & Test (pull_request) Successful in 6m33s
CI / all-required (pull_request) Successful in 5m9s
audit-force-merge / audit (pull_request) Has been skipped
When the merge API returns a non-transient error (HTTP 405 permission
denied, HTTP 422 pre-receive hook block, etc.), the queue was catching
ApiError in the generic main-loop handler and exiting 0 — indistinguishable
from a successful-no-op tick.

Fix: catch ApiError specifically around merge_pull(), post a PR comment
with the error detail and a reference to SEV-1 internal#487, and return
exit code 2 so the workflow run is marked failed.

Exit codes:
  0 — success (merged, updated, or nothing to do)
  2 — merge API error (permission/hook issue, non-transient)

Fixes: SEV-1 internal#487 — queue silently failing to merge while
reporting success; merge permission error invisible without workflow
log inspection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 06:24:11 +00:00
core-devops 9ede993f3d fix(sop-checklist): probe() KeyError for gate names in compute_na_state
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 3s
CI / Detect changes (pull_request) Successful in 4s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 14s
E2E API Smoke Test / detect-changes (pull_request) Successful in 5s
E2E Chat / detect-changes (pull_request) Successful in 6s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 5s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 7s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 6s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 4s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 58s
gate-check-v3 / gate-check (pull_request) Successful in 4s
qa-review / approved (pull_request) Failing after 3s
security-review / approved (pull_request) Failing after 3s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Successful in 1m0s
sop-tier-check / tier-check (pull_request) Successful in 4s
CI / Platform (Go) (pull_request) Successful in 4m52s
CI / Canvas (Next.js) (pull_request) Successful in 6m35s
CI / Python Lint & Test (pull_request) Successful in 6m38s
CI / all-required (pull_request) Successful in 6m39s
E2E Chat / E2E Chat (pull_request) Successful in 2s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 1s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 1s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 2s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 2s
sop-checklist / all-items-acked (pull_request) [info tier:low] acked: 2/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +2 — body-unfilled: comprehensive-testing, l
sop-checklist / na-declarations (pull_request) N/A: (none)
audit-force-merge / audit (pull_request) Has been skipped
compute_na_state() calls probe(gate_name, [user]) where gate_name is a gate
name like 'qa-review' or 'security-review' — these are not checklist item
slugs and are not in items_by_slug. probe() was doing:

    item = items_by_slug[slug]   # KeyError for 'qa-review'

This caused the sop-checklist workflow to crash on any PR that has N/A gates
configured (all 7 checklist items with /sop-n/a), producing a 30-minute
Failing status before Gitea kills the job.

Fix: add _required_teams_for() helper that falls back to na_gates lookup
when slug is not in items_by_slug. Gate names resolve to their
required_teams from the n/a_gates config section.

Adds TestProbeNaGateFallback regression test (58/58 passing).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 05:09:35 +00:00
9 changed files with 225 additions and 238 deletions
+68 -107
View File
@@ -44,15 +44,9 @@ REQUIRED_CONTEXTS_RAW = _env(
"REQUIRED_CONTEXTS",
default=(
"CI / all-required (pull_request),"
"sop-checklist / all-items-acked (pull_request),"
"E2E Chat / E2E Chat (pull_request)"
"sop-checklist / all-items-acked (pull_request)"
),
)
# E2E Chat is not in branch protection's status_check_contexts, but Gitea's
# merge gate evaluates the full combined status including it. Adding it here
# prevents the queue from attempting a merge that will be 405'd by Gitea when
# E2E Chat is failing (e.g. runner-stall Quirk #9 on a flaky test).
# See: mc#420 / molecule-core runbooks/gitea-operational-quirks.md Quirk #9.
# Required contexts for push (main/staging) runs. The push CI uses the same
# aggregator names with " (push)" suffix. Checking these explicitly instead of
# the combined state avoids false-pause when non-blocking jobs (e.g. Platform
@@ -71,11 +65,6 @@ class ApiError(RuntimeError):
pass
class MergePermissionError(ApiError):
"""Merge failed with a permanent permission error (403/404/405).
The queue should skip this PR and move to the next one."""
@dataclasses.dataclass(frozen=True)
class MergeDecision:
ready: bool
@@ -148,14 +137,25 @@ def status_state(status: dict) -> str:
def latest_statuses_by_context(statuses: list[dict]) -> dict[str, dict]:
# Gitea /statuses endpoint returns entries in ascending id order (oldest
# first). We need the LAST occurrence of each context, so iterate in
# reverse to prefer newer entries.
# Iterate so the newest entry for each context is seen LAST → it overwrites
# older ones in the accumulator dict.
# - Ascending input (oldest first, e.g. Gitea /status base array): forward
# iteration processes oldest first, newest last → newest overwrites → OK.
# - Descending input (newest first, e.g. Gitea /statuses, combined array):
# forward iteration processes newest first → oldest last → oldest wins.
# Must REVERSE so iteration is oldest→newest → newest wins.
# Guard: detect ascending by checking last_id > first_id.
if not statuses:
return {}
ids = [s.get("id", 0) for s in statuses if isinstance(s.get("id"), int)]
if ids and ids[-1] < ids[0]:
# Descending (newest first) — reverse to oldest→newest iteration.
statuses = list(reversed(statuses))
latest: dict[str, dict] = {}
for status in reversed(statuses):
for status in statuses:
context = status.get("context")
if isinstance(context, str):
latest[context] = status # overwrite: reverse order → newest wins
latest[context] = status
return latest
@@ -257,37 +257,54 @@ def get_branch_head(branch: str) -> str:
def get_combined_status(sha: str) -> dict:
"""Combined status + all individual statuses for `sha`.
The /status endpoint caps the `statuses` array at 30 entries (Gitea
default page size), so we fetch the full list via /statuses with a
higher limit. The combined `state` still comes from /status.
The /status endpoint returns a `statuses` array capped at 30 entries.
We supplement it with /statuses (limit=100) for contexts not in the
base array. The combined `state` always comes from /status.
Returns the merged list sorted ASCENDING by id. Caller's
latest_statuses_by_context iterates ascending so the newest (largest
id) for each context is seen last and wins.
"""
_, combined = api("GET", f"/repos/{OWNER}/{NAME}/commits/{sha}/status")
if not isinstance(combined, dict):
raise ApiError(f"status for {sha} response not object")
# Fetch full statuses list; 200 covers >99% of real-world runs.
# The list is ordered ascending by id (oldest first) — callers must
# iterate in reverse to get the newest entry per context.
# Best-effort: large repos (main with 550+ statuses) may time out.
# On timeout, fall back to the statuses[] already in the combined
# response (usually 30 entries — enough for most PRs, enough for
# main's early push-required contexts).
base_statuses: list[dict] = combined.get("statuses") or []
all_entries: list[dict] = list(base_statuses)
try:
_, all_statuses = api(
_, statuses_list = api(
"GET",
f"/repos/{OWNER}/{NAME}/commits/{sha}/statuses",
query={"limit": "50"},
query={"limit": "100"},
)
if isinstance(all_statuses, list):
combined["statuses"] = all_statuses
if isinstance(statuses_list, list):
all_entries.extend(statuses_list)
except (ApiError, urllib.error.URLError, TimeoutError, OSError) as exc:
# URLError covers network-level failures (DNS, refused, timeout).
# TimeoutError and OSError cover socket-level timeouts.
sys.stderr.write(f"::warning::could not fetch full statuses list for {sha[:8]}: {exc}\n")
# Fall back to the statuses[] already in the combined response.
pass
# Sort ascending by id. latest_statuses_by_context iterates ascending
# so the newest (largest id) entry for each context is seen last and wins.
all_entries.sort(key=lambda s: s.get("id") or 0)
combined["statuses"] = all_entries
return combined
def _resolve_label_id(name: str) -> str | None:
"""Return the repo label ID for `name`, or None if not found.
Gitea's /issues endpoint with labels=<name> has a known quirk: when multiple
repo labels share the same name (e.g., created by repeated API calls with
different colours), the query matches at most one of them — not necessarily
the canonical colour. Resolving to ID sidesteps the ambiguity.
"""
_, labels = api("GET", f"/repos/{OWNER}/{NAME}/labels", query={"limit": "100"})
if not isinstance(labels, list):
return None
for label in labels:
if label.get("name") == name:
return str(label["id"])
return None
def list_queued_issues() -> list[dict]:
_, body = api(
"GET",
@@ -325,31 +342,6 @@ def post_comment(pr_number: int, body: str, *, dry_run: bool) -> None:
api("POST", f"/repos/{OWNER}/{NAME}/issues/{pr_number}/comments", body={"body": body})
def add_hold_label(pr_number: int, *, dry_run: bool) -> None:
"""Add HOLD_LABEL to a PR if not already present."""
if not HOLD_LABEL:
return
# Check current labels first to avoid a no-op API call in dry-run.
_, current = api("GET", f"/repos/{OWNER}/{NAME}/issues/{pr_number}/labels")
current_names = {
l["name"] for l in (current if isinstance(current, list) else [])
}
if HOLD_LABEL in current_names:
print(f"::notice::PR #{pr_number} already has hold label; skipping add")
return
print(f"::notice::PR #{pr_number} adding hold label `{HOLD_LABEL}`")
if dry_run:
return
# Gitea accepts {"labels": ["label1", "label2"]} to append labels.
new_labels = list(current_names) + [HOLD_LABEL]
api(
"PATCH",
f"/repos/{OWNER}/{NAME}/issues/{pr_number}",
body={"labels": new_labels},
expect_json=False,
)
def update_pull(pr_number: int, *, dry_run: bool) -> None:
print(f"::notice::updating PR #{pr_number} with base branch via style={UPDATE_STYLE}")
if dry_run:
@@ -374,16 +366,7 @@ def merge_pull(pr_number: int, *, dry_run: bool) -> None:
print(f"::notice::merging PR #{pr_number}")
if dry_run:
return
try:
api("POST", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/merge", body=payload, expect_json=False)
except ApiError as exc:
# Re-raise permission-like errors so process_once can skip this PR.
# 403 = no push access, 404 = repo/pr not found, 405 = not allowed.
msg = str(exc)
for code in ("403", "404", "405"):
if code in msg:
raise MergePermissionError(msg) from exc
raise # re-raise other ApiErrors unchanged
api("POST", f"/repos/{OWNER}/{NAME}/pulls/{pr_number}/merge", body=payload, expect_json=False)
def process_once(*, dry_run: bool = False) -> int:
@@ -454,43 +437,21 @@ def process_once(*, dry_run: bool = False) -> int:
return 0
try:
merge_pull(pr_number, dry_run=dry_run)
except MergePermissionError as exc:
msg = str(exc)
is_status_check_failure = "not all required status checks successful" in msg
if is_status_check_failure:
# Gitea's merge gate failed due to a status check that passed our
# pre-flight but is failing at Gitea's side (e.g. runner-stall Quirk
# #9, or a context not in REQUIRED_CONTEXTS). Auto-add hold so the
# queue skips this PR and processes the next one. The hold can be
# removed once CI is green again.
add_hold_label(pr_number, dry_run=dry_run)
post_comment(
pr_number,
(
"merge-queue: merge blocked by Gitea's status-check gate "
"(E2E Chat or other non-required context failing). "
"Auto-held via `merge-queue-hold`. "
"Remove the hold label to requeue once CI is green. "
"If E2E Chat is stuck (runner stall / Quirk #9), CI will "
"self-recover after ~90 min and the hold can then be removed."
),
dry_run=dry_run,
)
return 0
else:
# Genuine permission error — token lacks Can-merge.
sys.stderr.write(f"::error::merge permission error for PR #{pr_number}: {exc}\n")
post_comment(
pr_number,
(
"merge-queue: merge failed with HTTP 405 'User not allowed to merge PR'. "
"No available token has Can-merge permission on this repo. "
"Fix: grant Can-merge to a token, or add a maintain/admin collaborator. "
"Skipping to next queued PR on next tick."
),
dry_run=dry_run,
)
return 0
except ApiError as exc:
# Merge API errors (405 permission denied, 422 hook block, etc.)
# are NOT transient — retrying will not help. Surface the error
# on the PR immediately so it is visible without digging into
# workflow logs, and fail the workflow so it is distinguishable
# from a successful-no-op tick.
post_comment(
pr_number,
f"merge-queue: MERGE FAILED — {exc}. "
"This is a non-transient error (permission or hook issue). "
"See SEV-1 internal#487.",
dry_run=dry_run,
)
sys.stderr.write(f"::error::PR #{pr_number} merge failed: {exc}\n")
return 2 # distinct exit code so workflow run shows failure
return 0
return 0
+11 -2
View File
@@ -830,9 +830,18 @@ def main(argv: list[str] | None = None) -> int:
# one membership lookup per team.
team_member_cache: dict[tuple[str, int], bool | None] = {}
def _required_teams_for(slug: str) -> list[str] | None:
"""Look up required_teams for a slug from checklist items OR N/A gates."""
if slug in items_by_slug:
return items_by_slug[slug]["required_teams"]
if slug in na_gates:
return na_gates[slug].get("required_teams", [])
return None
def probe(slug: str, users: list[str]) -> list[str]:
item = items_by_slug[slug]
team_names: list[str] = item["required_teams"]
team_names = _required_teams_for(slug)
if team_names is None:
raise KeyError(f"slug '{slug}' not found in items or N/A gates")
# Resolve names → ids. NOTE: orgs/{org}/teams/search may not be
# available — fall back to the list endpoint.
team_ids: list[int] = []
+74 -11
View File
@@ -1,6 +1,7 @@
import importlib.util
import sys
from pathlib import Path
from unittest.mock import patch
SCRIPT = Path(__file__).resolve().parents[1] / "gitea-merge-queue.py"
@@ -10,16 +11,37 @@ sys.modules[spec.name] = mq
spec.loader.exec_module(mq)
def test_latest_statuses_dedupes_by_context_newest_first():
def test_latest_statuses_ascending_input_newest_wins():
# Gitea /status (base array) returns ascending id order (oldest first).
# Forward iteration processes oldest first, newest last → newest overwrites.
statuses = [
{"context": "CI / all-required (pull_request)", "status": "failure"},
{"context": "sop-checklist / all-items-acked (pull_request)", "state": "success"},
{"context": "CI / all-required (pull_request)", "status": "success"},
{"id": 18, "context": "CI / all-required (pull_request)", "status": "failure"}, # oldest
{"id": 27, "context": "sop-checklist / all-items-acked (pull_request)", "state": "success"},
{"id": 54, "context": "CI / all-required (pull_request)", "status": "success"}, # newest
]
latest = mq.latest_statuses_by_context(statuses)
assert latest["CI / all-required (pull_request)"]["status"] == "failure"
assert latest["CI / all-required (pull_request)"]["status"] == "success"
assert latest["CI / all-required (pull_request)"]["id"] == 54
assert latest["sop-checklist / all-items-acked (pull_request)"]["state"] == "success"
def test_latest_statuses_guard_reverses_descending_input():
# Gitea /statuses returns descending id order (newest first: id=54 → id=1).
# Guard detects descending and reverses so we iterate ascending.
# Forward on reversed = newest (id=54) is last → overwrites oldest.
statuses = [
{"id": 54, "context": "CI / all-required (pull_request)", "status": "success"}, # newest
{"id": 27, "context": "sop-checklist / all-items-acked (pull_request)", "state": "success"},
{"id": 18, "context": "CI / all-required (pull_request)", "status": "failure"}, # oldest
]
latest = mq.latest_statuses_by_context(statuses)
# Guard reverses descending → asc iteration: 18 first, 27, 54 last → 54 wins.
assert latest["CI / all-required (pull_request)"]["status"] == "success"
assert latest["CI / all-required (pull_request)"]["id"] == 54
assert latest["sop-checklist / all-items-acked (pull_request)"]["state"] == "success"
@@ -120,11 +142,52 @@ def test_merge_decision_updates_stale_pr_before_merge():
assert decision.action == "update"
def test_MergePermissionError_inherits_from_ApiError():
assert issubclass(mq.MergePermissionError, mq.ApiError)
def test_merge_failure_returns_nonzero_and_posts_comment(monkeypatch):
"""When merge_pull raises ApiError (e.g. HTTP 405 permission denied),
process_once returns exit code 2 (non-zero) and posts a comment on the PR.
This distinguishes merge-permission errors from successful-no-op ticks."""
captured_comment = {}
def fake_post_comment(pr_number, body, *, dry_run):
captured_comment["pr_number"] = pr_number
captured_comment["body"] = body
def test_MergePermissionError_message_preserved():
exc = mq.MergePermissionError("POST /merge -> HTTP 405: User not allowed")
assert "405" in str(exc)
assert "User not allowed" in str(exc)
# Replace functions directly on the module object so process_once()
# (which looks them up by name at call time) picks up the fakes.
mq.list_queued_issues = lambda: [{
"number": 42,
"created_at": "2026-05-17T00:00:00Z",
"labels": [{"name": "merge-queue"}],
"pull_request": {},
}]
mq.get_pull = lambda n: {
"state": "open",
"base": {"ref": "main", "repo_id": 1},
"head": {"sha": "headsha", "repo_id": 1},
"merge_base": "abc123def",
}
mq.get_pull_commits = lambda n: [{"sha": "headsha"}]
mq.get_branch_head = lambda branch: "abc123def"
mq.get_combined_status = lambda sha: {
"state": "success",
"statuses": [{"context": "CI / all-required (push)", "status": "success"}],
}
mq.latest_statuses_by_context = lambda s: {
"CI / all-required (pull_request)": {"status": "success"},
"sop-checklist / all-items-acked (pull_request)": {"status": "success"},
}
mq.required_contexts_green = lambda statuses, contexts: (True, [])
mq.post_comment = fake_post_comment
# Simulate merge failing with HTTP 405 (permission denied).
# The ApiError raised by api() is caught inside process_once().
merge_error = mq.ApiError(
"POST /repos/x/y/pulls/42/merge -> HTTP 405: User not allowed to merge PR"
)
with patch.object(mq, "merge_pull", side_effect=merge_error):
exit_code = mq.process_once(dry_run=False)
assert exit_code == 2, f"Expected exit code 2, got {exit_code}"
assert captured_comment["pr_number"] == 42
assert "MERGE FAILED" in captured_comment["body"]
assert "405" in captured_comment["body"]
@@ -603,3 +603,51 @@ class TestComputeNaState(unittest.TestCase):
self.assertEqual(na_directives[0][0], "sop-n/a")
self.assertEqual(na_directives[0][1], "qa-review")
self.assertIn("no surface", na_directives[0][2])
class TestProbeNaGateFallback(unittest.TestCase):
"""Regression test: probe() must handle gate names (qa-review, security-review)
from N/A gates without raising KeyError.
mc#1389: compute_na_state calls probe(gate_name, [user]) where gate_name is
a gate name like 'qa-review' — NOT a checklist item slug. The probe must
resolve the gate's required_teams from na_gates, not raise KeyError from
items_by_slug lookup.
"""
def test_probe_resolves_gate_name_from_na_gates(self):
cfg = sop.load_config(CONFIG_PATH)
items = cfg["items"]
items_by_slug = {it["slug"]: it for it in items}
na_gates = cfg.get("n/a_gates", {})
# Reconstruct the _required_teams_for helper from sop-checklist.py
def _required_teams_for(slug):
if slug in items_by_slug:
return items_by_slug[slug]["required_teams"]
if slug in na_gates:
return na_gates[slug].get("required_teams", [])
return None
# Gate names should resolve from na_gates
self.assertEqual(
_required_teams_for("qa-review"),
["qa", "security", "engineers"],
)
self.assertEqual(
_required_teams_for("security-review"),
["security", "managers", "ceo"],
)
# Checklist item slugs should still resolve from items_by_slug
self.assertEqual(
_required_teams_for("comprehensive-testing"),
["qa", "engineers"],
)
self.assertEqual(
_required_teams_for("root-cause"),
["managers", "ceo"],
)
# Unknown slug should return None (not raise KeyError)
self.assertIsNone(_required_teams_for("nonexistent-slug"))
+1 -1
View File
@@ -57,7 +57,7 @@ permissions:
# can produce duplicate comments before the title-search dedup wins.
concurrency:
group: ci-required-drift
cancel-in-progress: true
cancel-in-progress: false
jobs:
drift:
+1 -1
View File
@@ -22,7 +22,7 @@ permissions:
concurrency:
group: gitea-merge-queue-${{ github.repository }}
cancel-in-progress: true
cancel-in-progress: false
jobs:
queue:
+1 -5
View File
@@ -56,13 +56,9 @@ permissions:
# Workflow-scoped serialisation — two simultaneous runs would race on the
# `[main-red] {SHA}` open/PATCH path. Idempotent by title, but parallel
# POSTs can produce duplicates before the title search dedup wins.
# NOTE: cancel-in-progress: true is safe here — the idempotent design means
# a cancelled run produces identical output to a completed one. This also
# prevents the Gitea scheduler freeze that occurs when a cron tick fires
# while a previous run is still executing (Quirk #8).
concurrency:
group: main-red-watchdog
cancel-in-progress: true
cancel-in-progress: false
jobs:
watchdog:
-25
View File
@@ -77,31 +77,6 @@ does not replace the queue. The queue still performs its own current-main
check immediately before merge because branch protection alone cannot
serialize two already-green PRs.
### Correct API field names (Gitea 1.22.6)
When setting branch protection via API, use these exact field names — several
intuitively-correct names are silently ignored (see `gitea-operational-quirks.md`
Quirk #7):
```json
{
"branch_name": "main",
"enable_merge_whitelist": true,
"merge_whitelist_usernames": ["devops-engineer", "hongming", "core-devops"],
"enable_status_check": true,
"status_check_contexts": ["CI / all-required"],
"required_approvals": 1,
"block_on_rejected_reviews": true
}
```
After any `POST /branch_protections`, immediately GET and verify the values
persisted — the API returns 201 even when fields are silently dropped.
If the queue returns HTTP 405 ("User not allowed to merge"), the first
diagnostic step is `GET /branch_protections/main` and checking whether
`merge_whitelist_usernames` still contains `devops-engineer`.
## Failure Handling
If `main` is not green, the queue pauses and does not merge anything.
+21 -86
View File
@@ -196,134 +196,69 @@ primary consumer of combined status and is affected.
---
## Quirk #7 — Gitea branch protection API silently ignores some field names
## Quirk #7 — TBD
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
### Finding
The Gitea 1.22.6 `POST /repos/{org}/{repo}/branch_protections` API accepts a
non-obvious set of field names. Several intuitively-correct names are silently
ignored — the call returns 201 but the field is dropped:
| Intended field | Correct API name | Silently ignored aliases |
|---|---|---|
| Enable merge whitelist | `enable_merge_whitelist` | `user_can_merge`, `merge_whitelist_enabled` |
| Users who can merge | `merge_whitelist_usernames` | `merge_whitelist_users`, `whitelisted_users` |
| Enable status check | `enable_status_check` | `enable_status_checks`, `require_status_checks` |
| Required status contexts | `status_check_contexts` | `required_status_checks.contexts` |
| Block on rejected reviews | `block_on_rejected_reviews` | (this one works) |
| Required approvals | `required_approvals` | `required_reviewers` |
The GET response after a POST shows the actual stored values. A naive
GET → modify → POST cycle (without using the exact GET field names) will
silently reset the merge whitelist on every call.
*[What Gitea Actions does differently from GitHub Actions.]*
### Impact
- Branch protection merge whitelist resets to empty after any API mis-invocation
- Queue AUTO_SYNC_TOKEN (`devops-engineer`) loses Can-merge permission → HTTP 405
- All queued PRs blocked until whitelist is restored
- Confirmed reset on Gitea server restart/upgrade (Gitea uses default values)
*[Which workflows or operations are affected.]*
### Workaround
1. Always GET the current protection first and use **exact** field names from the
GET response when modifying
2. After any `POST /branch_protections`, immediately GET and verify
`enable_merge_whitelist: true` and `merge_whitelist_usernames` contains
`["devops-engineer", "hongming", "core-devops"]`
3. The queue bot should verify branch protection before each merge tick
4. For queue to work: `enable_merge_whitelist: true` +
`merge_whitelist_usernames: ["devops-engineer", "hongming", "core-devops"]` +
`enable_status_check: true` + `status_check_contexts: ["CI / all-required"]`
*[How to work around this quirk.]*
### References
- SEV-1 2026-05-17: 3x branch protection resets caused 405 on all queue merges
- `feedback_gitea_branch_protection_api_field_names`
- internal#[N]: first observation
---
## Quirk #8 — Scheduled workflow with `cancel-in-progress: false` causes scheduler freeze
## Quirk #8 — TBD
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
### Finding
When a `schedule:` workflow has `concurrency.cancel-in-progress: false`, and a
new cron tick fires while the previous run is still executing, the Gitea Actions
scheduler stops dispatching the workflow entirely. Pending entries accumulate
indefinitely — the scheduler shows the workflow as "scheduled" but never dispatches.
This is dangerous for workflows with variable execution time (e.g., workflows that
wait for downstream CI, or workflows that run on slow/degraded runners).
*[What Gitea Actions does differently from GitHub Actions.]*
### Impact
- `gitea-merge-queue.yml` with `cancel-in-progress: false` froze on 2026-05-17
starting ~16:44Z — pending runs accumulated, no new runs dispatched
- Queue appeared stalled; all 22 queued PRs blocked
- The `gitea-merge-queue` workflow itself becomes invisible to operators
*[Which workflows or operations are affected.]*
### Workaround
**Always set `cancel-in-progress: true` on `schedule:` workflows:**
```yaml
concurrency:
group: workflow-name
cancel-in-progress: true # ← always true for schedule: workflows
```
If the freeze has already occurred: the scheduler recovers automatically after the
currently-running instance completes (Gitea dispatches the next queued tick).
*[How to work around this quirk.]*
### References
- SEV-1 2026-05-17: queue frozen since 16:44Z; fixed by setting `cancel-in-progress: true`
- PR #1358: `fix(scheduled-workflows): enable cancel-in-progress` (pending merge)
- internal#[N]: first observation
---
## Quirk #9 — Gitea Actions runner accepts runs but stalls (jobs never start)
## Quirk #9 — TBD
*[Placeholder — document here when a new Gitea Actions quirk is discovered.]*
### Finding
The Gitea Actions runner on host `5.78.80.188` can enter a degraded state where:
1. It accepts new workflow runs (shows "in_progress" in the UI)
2. It never starts any jobs — pending count grows indefinitely
3. The runner shows as "online" and accepting runs
4. After ~6090 minutes, the runner self-recovers and all pending jobs start
This is distinct from a true runner crash (which would show as offline).
*[What Gitea Actions does differently from GitHub Actions.]*
### Impact
- All CI jobs for all PRs stall — no status updates posted
- Queue waits indefinitely for CI (which never posts success)
- `sop-checklist` and other workflows time out on affected PRs
- Looks like the runner is working (green in UI) but nothing executes
### How to diagnose
Add a debug step to a known-failing workflow:
```bash
# In a stalled job:
curl -s http://localhost:8088/debug/pprof/trace?seconds=5 | head
# Check runner process CPU — if near 0% while jobs are pending, runner is stalled
```
Check runner logs on the host (`/var/log/actrunner.log` or similar).
*[Which workflows or operations are affected.]*
### Workaround
No operator workaround while stalled — the runner self-recovers. Options:
1. **Wait** — runner typically recovers within 90 minutes
2. **Restart the runner service**`systemctl restart act_runner` (requires host access)
3. **Move to a second runner** — if registered, re-route dispatch
*[How to work around this quirk.]*
### References
- SEV-1 2026-05-17: runner stalled; self-recovered ~21:33Z after ~90 min
- `feedback_gitea_runner_stall_accepted_jobs_no_execution`
- internal#[N]: first observation
---