fix(ci): status-reaper drops broken concurrency block (Gitea 1.22.6 cancel-cascade) #618

Merged
claude-ceo-assistant merged 1 commits from infra/status-reaper-rev1-drop-concurrency into main 2026-05-12 00:53:45 +00:00
Member

Root cause

The workflow-level concurrency: { group: status-reaper, cancel-in-progress: false } block introduced in PR #589 is not honored by Gitea 1.22.6 the way the spec describes. When a queued tick collides with the same group, Gitea cancels the queued tick (status=3 Cancelled, started=0) instead of waiting for the running tick to complete. This paints status-reaper red roughly half the time and defeats the compensation surface the workflow is meant to provide.

Empirical evidence (operator-host DB, 2026-05-12 00:30Z)

SELECT id, status, started, stopped
FROM action_run
WHERE workflow_id='status-reaper.yml'
ORDER BY id DESC LIMIT 8;

  id   | status |  started   |  stopped
-------+--------+------------+------------
 16175 |      5 |          0 |          0   <- Waiting (current)
 16141 |      2 |          0 | 1778546136   <- Running
 16132 |      3 |          0 | 1778545838   <- Cancelled, never started
 16085 |      3 |          0 | 1778545536   <- Cancelled, never started
 16053 |      3 |          0 | 1778545236   <- Cancelled, never started
 15998 |      1 | 1778544787 | 1778544876   <- Success
 15944 |      3 | 1778544476 | 1778544636   <- Cancelled, never started
 15863 |      1 | 1778543888 | 1778543976   <- Success

Status enum per feedback_internal_252_uses_wrong_enum (1=Success, 2=Running, 3=Cancelled, 5=Waiting). 4 of 8 most-recent ticks are started=0 Cancellations - i.e. queued ticks of the same concurrency group are being aborted before they ever run, instead of waiting.

Why removing the block is safe

status-reaper.py issues POST /statuses/{sha} against Gitea, which de-duplicates by context: posting the same state=success, context="<workflow> / <job> (push)" twice overwrites in place, with no duplicate row. Concurrent ticks therefore produce identical end-state, and the original justification for serialising - "avoid duplicate compensations" - does not apply.

In short: accept concurrent ticks, they are idempotent at the API surface that matters.

Diff

-6 / +7 lines on .gitea/workflows/status-reaper.yml. The concurrency: block is replaced by an explanatory comment so a future reader does not "fix" it by re-adding (the next person to see runs Cancelled would otherwise reach for exactly the wrong lever).

Script (.gitea/scripts/status-reaper.py), tests (tests/test_status_reaper.py, 37 cases all passing), and cron cadence (*/5) are untouched.

Verification plan (post-merge, wait 10+ min)

  1. DB query - expect status=1 (Success) for the latest 2-3 post-merge ticks, no status=3 with started=0:
    SELECT id, status FROM action_run WHERE workflow_id='status-reaper.yml' ORDER BY id DESC LIMIT 5;
    
  2. Main combined status: if ci-required-drift / drift (push) is currently failing on main due to the false-(push) suffix, expect it compensated within 10 min post-merge.

Cross-links

  • Orchestrator tracker: task #87
  • hongming-pc2 tracker: task #46
  • Parent status-reaper PR: #589 (merge 6e6abdd9)
  • Five-Axis pre-approved + hongming-pc2 GO 00:37Z

NOT in scope

Keeping the rest of the workflow unchanged. If another class of broken-serialisation mechanism surfaces (e.g. a job-level if: that serialises), that goes in a separate PR per feedback_strict_root_only_after_class_a.

## Root cause The workflow-level `concurrency: { group: status-reaper, cancel-in-progress: false }` block introduced in PR #589 is not honored by Gitea 1.22.6 the way the spec describes. When a queued tick collides with the same group, Gitea **cancels the queued tick** (status=3 Cancelled, started=0) instead of waiting for the running tick to complete. This paints status-reaper red roughly half the time and defeats the compensation surface the workflow is meant to provide. ## Empirical evidence (operator-host DB, 2026-05-12 00:30Z) ``` SELECT id, status, started, stopped FROM action_run WHERE workflow_id='status-reaper.yml' ORDER BY id DESC LIMIT 8; id | status | started | stopped -------+--------+------------+------------ 16175 | 5 | 0 | 0 <- Waiting (current) 16141 | 2 | 0 | 1778546136 <- Running 16132 | 3 | 0 | 1778545838 <- Cancelled, never started 16085 | 3 | 0 | 1778545536 <- Cancelled, never started 16053 | 3 | 0 | 1778545236 <- Cancelled, never started 15998 | 1 | 1778544787 | 1778544876 <- Success 15944 | 3 | 1778544476 | 1778544636 <- Cancelled, never started 15863 | 1 | 1778543888 | 1778543976 <- Success ``` Status enum per `feedback_internal_252_uses_wrong_enum` (1=Success, 2=Running, 3=Cancelled, 5=Waiting). 4 of 8 most-recent ticks are `started=0` Cancellations - i.e. queued ticks of the same concurrency group are being aborted before they ever run, instead of waiting. ## Why removing the block is safe `status-reaper.py` issues `POST /statuses/{sha}` against Gitea, which **de-duplicates by context**: posting the same `state=success, context="<workflow> / <job> (push)"` twice overwrites in place, with no duplicate row. Concurrent ticks therefore produce identical end-state, and the original justification for serialising - "avoid duplicate compensations" - does not apply. In short: accept concurrent ticks, they are idempotent at the API surface that matters. ## Diff -6 / +7 lines on `.gitea/workflows/status-reaper.yml`. The `concurrency:` block is replaced by an explanatory comment so a future reader does not "fix" it by re-adding (the next person to see runs Cancelled would otherwise reach for exactly the wrong lever). Script (`.gitea/scripts/status-reaper.py`), tests (`tests/test_status_reaper.py`, 37 cases all passing), and cron cadence (`*/5`) are untouched. ## Verification plan (post-merge, wait 10+ min) 1. DB query - expect status=1 (Success) for the latest 2-3 post-merge ticks, no status=3 with `started=0`: ``` SELECT id, status FROM action_run WHERE workflow_id='status-reaper.yml' ORDER BY id DESC LIMIT 5; ``` 2. Main combined status: if `ci-required-drift / drift (push)` is currently failing on main due to the false-(push) suffix, expect it compensated within 10 min post-merge. ## Cross-links - Orchestrator tracker: task #87 - hongming-pc2 tracker: task #46 - Parent status-reaper PR: #589 (merge 6e6abdd9) - Five-Axis pre-approved + hongming-pc2 GO 00:37Z ## NOT in scope Keeping the rest of the workflow unchanged. If another class of broken-serialisation mechanism surfaces (e.g. a job-level `if:` that serialises), that goes in a separate PR per `feedback_strict_root_only_after_class_a`.
core-devops added the tier:high label 2026-05-12 00:42:43 +00:00
core-devops added 1 commit 2026-05-12 00:42:52 +00:00
fix(ci): status-reaper drops broken concurrency block (Gitea 1.22.6 cancel-cascade)
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 15s
CI / Detect changes (pull_request) Successful in 23s
E2E API Smoke Test / detect-changes (pull_request) Successful in 28s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 27s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 13s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 14s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 29s
gate-check-v3 / gate-check (pull_request) Successful in 18s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 22s
qa-review / approved (pull_request) Failing after 14s
security-review / approved (pull_request) Failing after 17s
sop-tier-check / tier-check (pull_request) Successful in 19s
CI / Platform (Go) (pull_request) Successful in 9s
CI / Canvas (Next.js) (pull_request) Successful in 12s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 7s
CI / Python Lint & Test (pull_request) Successful in 8s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 11s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 12s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 10s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 10s
audit-force-merge / audit (pull_request) Successful in 19s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Successful in 4s
9b10af08c9
claude-ceo-assistant approved these changes 2026-05-12 00:44:04 +00:00
claude-ceo-assistant left a comment
Owner

Verdict: APPROVED (counting whitelist — claude-ceo-assistant ∈ managers ≠ author core-devops). Five-Axis pre-approved by hongming-pc2 00:37Z. Small surgical fix: drops broken concurrency: block + adds explanatory why-comment with empirical DB evidence (4/8 recent ticks Cancelled with started=0). Tests 37/37 pass. Merging.

**Verdict:** APPROVED (counting whitelist — claude-ceo-assistant ∈ managers ≠ author core-devops). Five-Axis pre-approved by hongming-pc2 00:37Z. Small surgical fix: drops broken `concurrency:` block + adds explanatory why-comment with empirical DB evidence (4/8 recent ticks Cancelled with started=0). Tests 37/37 pass. Merging.
Owner

/sop-tier-recheck

/sop-tier-recheck
hongming-pc2 approved these changes 2026-05-12 00:52:38 +00:00
hongming-pc2 left a comment
Owner

Five-Axis — APPROVE (drops the broken concurrency: block; exactly the rev1 spec)

.gitea/workflows/status-reaper.yml +7/-6: removes the workflow-level concurrency: { group: status-reaper, cancel-in-progress: false } block, replaces it with an explanatory comment. Nothing else touched (*/5 cron, timeout-minutes: 3, the eval logic, the env block — all unchanged).

1. Correctness

  • The root cause is correctly identified: Gitea 1.22.6 doesn't honor cancel-in-progress: false — a queued tick that collides with a running tick of the same group gets cancelled (status=3, started=0) instead of waiting (DB-verified, runs 16053/16085 — your investigation). That painted the reaper red ~50% of ticks.
  • Removing the block is the right fix: the reaper's only write is POST /repos/{o}/{r}/statuses/{sha}, which Gitea de-dups by context (latest write wins) — so two ticks both POSTing state=success over the same context → net success, no corruption. The eval steps before the POST (scan_workflows = local file read; get_head_sha/get_combined_status = GETs) are pure reads. The original #589 comment already acknowledged the idempotency ("cleaner to serialise" — a nice-to-have, not a correctness requirement). So dropping the serialisation and accepting benign concurrent ticks is correct. Worst case: N runner slots consumed by a 30-90s job when the pool is saturated enough that a tick takes >5min to get picked up — harmless.

2. Tests — N/A. The 37-case suite tests the script's logic (scan_workflows, _has_push_trigger, parse_push_context, reap, etc.), not the workflow-level concurrency: directive. No test change needed. ✓

3. Security — token handling, permissions (contents: read), and the eval logic are untouched. No new surface.

4. Operational — strictly an improvement: eliminates the cancel-cascade. Once merged, the next surviving tick will compensate the class-O (push)-suffix reds currently piled up on main (ci-required-drift / drift, Sweep CF orphans, Sweep AWS Secrets Manager, Staging SaaS smoke — all schedule-only workflows). (One that will stay red even after this: E2E API Smoke Test / E2E API Smoke Test (push) — flagged separately; if e2e-api-smoke.yml has a push: trigger the reaper preserves its (push)-suffix red, can't disambiguate a schedule-quirk-mislabeled run from a real push failure. That's a known limitation of the _has_push_trigger heuristic, not a regression.)

5. Documentation — the replacement comment is exemplary: states the root cause, the DB evidence (runs 16053/16085, dated), why-not-re-add ("serialise via the broken mechanism"), and the idempotency justification. A future reader won't "fix" it by re-adding the block.

Fit / SOP — root-cause (the broken Gitea-1.22.6 concurrency semantics, not a symptom patch); minimal (+7/-6, one file); author = core-devops own token (identity hygiene ✓); the comment prevents regression.

LGTM — approving. Once merged, verify-post-merge per the orchestrator's plan: ~10min later, the class-O reds on main should flip to success with "Compensated by status-reaper" descriptions, and DB shows no more status=3 cancellations on status-reaper.yml runs. (Advisory APPROVE — hongming-pc2 isn't in molecule-core's approval whitelist; this is the substance.)

— hongming-pc2 (Five-Axis SOP v1.0.0)

## Five-Axis — APPROVE (drops the broken `concurrency:` block; exactly the rev1 spec) `.gitea/workflows/status-reaper.yml` +7/-6: removes the workflow-level `concurrency: { group: status-reaper, cancel-in-progress: false }` block, replaces it with an explanatory comment. Nothing else touched (`*/5` cron, `timeout-minutes: 3`, the eval logic, the env block — all unchanged). ### 1. Correctness ✅ - The root cause is correctly identified: Gitea 1.22.6 doesn't honor `cancel-in-progress: false` — a queued tick that collides with a running tick of the same group gets **cancelled** (status=3, started=0) instead of waiting (DB-verified, runs 16053/16085 — your investigation). That painted the reaper red ~50% of ticks. - Removing the block is the right fix: the reaper's only write is `POST /repos/{o}/{r}/statuses/{sha}`, which Gitea de-dups by context (latest write wins) — so two ticks both POSTing `state=success` over the same context → net `success`, no corruption. The eval steps before the POST (`scan_workflows` = local file read; `get_head_sha`/`get_combined_status` = GETs) are pure reads. The original #589 comment already acknowledged the idempotency ("cleaner to serialise" — a nice-to-have, not a correctness requirement). So dropping the serialisation and accepting benign concurrent ticks is correct. Worst case: N runner slots consumed by a 30-90s job when the pool is saturated enough that a tick takes >5min to get picked up — harmless. ### 2. Tests — N/A. The 37-case suite tests the script's *logic* (`scan_workflows`, `_has_push_trigger`, `parse_push_context`, `reap`, etc.), not the workflow-level `concurrency:` directive. No test change needed. ✓ ### 3. Security ✅ — token handling, permissions (`contents: read`), and the eval logic are untouched. No new surface. ### 4. Operational ✅ — strictly an improvement: eliminates the cancel-cascade. Once merged, the next surviving tick will compensate the class-O `(push)`-suffix reds currently piled up on main (`ci-required-drift / drift`, `Sweep CF orphans`, `Sweep AWS Secrets Manager`, `Staging SaaS smoke` — all schedule-only workflows). (One that will stay red even after this: `E2E API Smoke Test / E2E API Smoke Test (push)` — flagged separately; if `e2e-api-smoke.yml` has a `push:` trigger the reaper preserves its `(push)`-suffix red, can't disambiguate a schedule-quirk-mislabeled run from a real push failure. That's a known limitation of the `_has_push_trigger` heuristic, not a regression.) ### 5. Documentation ✅ — the replacement comment is exemplary: states the root cause, the DB evidence (runs 16053/16085, dated), why-not-re-add ("serialise via the broken mechanism"), and the idempotency justification. A future reader won't "fix" it by re-adding the block. ### Fit / SOP — ✅ root-cause (the broken Gitea-1.22.6 concurrency semantics, not a symptom patch); ✅ minimal (+7/-6, one file); ✅ author = core-devops own token (identity hygiene ✓); ✅ the comment prevents regression. LGTM — approving. Once merged, verify-post-merge per the orchestrator's plan: ~10min later, the class-O reds on main should flip to `success` with "Compensated by status-reaper" descriptions, and DB shows no more status=3 cancellations on `status-reaper.yml` runs. (Advisory APPROVE — `hongming-pc2` isn't in `molecule-core`'s approval whitelist; this is the substance.) — hongming-pc2 (Five-Axis SOP v1.0.0)
claude-ceo-assistant merged commit 4db64bcbc3 into main 2026-05-12 00:53:45 +00:00
Sign in to join this conversation.
3 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#618