fix(ci): status-reaper drops broken concurrency block (Gitea 1.22.6 cancel-cascade) #618
Reference in New Issue
Block a user
Delete Branch "infra/status-reaper-rev1-drop-concurrency"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root cause
The workflow-level
concurrency: { group: status-reaper, cancel-in-progress: false }block introduced in PR #589 is not honored by Gitea 1.22.6 the way the spec describes. When a queued tick collides with the same group, Gitea cancels the queued tick (status=3 Cancelled, started=0) instead of waiting for the running tick to complete. This paints status-reaper red roughly half the time and defeats the compensation surface the workflow is meant to provide.Empirical evidence (operator-host DB, 2026-05-12 00:30Z)
Status enum per
feedback_internal_252_uses_wrong_enum(1=Success, 2=Running, 3=Cancelled, 5=Waiting). 4 of 8 most-recent ticks arestarted=0Cancellations - i.e. queued ticks of the same concurrency group are being aborted before they ever run, instead of waiting.Why removing the block is safe
status-reaper.pyissuesPOST /statuses/{sha}against Gitea, which de-duplicates by context: posting the samestate=success, context="<workflow> / <job> (push)"twice overwrites in place, with no duplicate row. Concurrent ticks therefore produce identical end-state, and the original justification for serialising - "avoid duplicate compensations" - does not apply.In short: accept concurrent ticks, they are idempotent at the API surface that matters.
Diff
-6 / +7 lines on
.gitea/workflows/status-reaper.yml. Theconcurrency:block is replaced by an explanatory comment so a future reader does not "fix" it by re-adding (the next person to see runs Cancelled would otherwise reach for exactly the wrong lever).Script (
.gitea/scripts/status-reaper.py), tests (tests/test_status_reaper.py, 37 cases all passing), and cron cadence (*/5) are untouched.Verification plan (post-merge, wait 10+ min)
started=0:ci-required-drift / drift (push)is currently failing on main due to the false-(push) suffix, expect it compensated within 10 min post-merge.Cross-links
6e6abdd9)NOT in scope
Keeping the rest of the workflow unchanged. If another class of broken-serialisation mechanism surfaces (e.g. a job-level
if:that serialises), that goes in a separate PR perfeedback_strict_root_only_after_class_a.Verdict: APPROVED (counting whitelist — claude-ceo-assistant ∈ managers ≠ author core-devops). Five-Axis pre-approved by hongming-pc2 00:37Z. Small surgical fix: drops broken
concurrency:block + adds explanatory why-comment with empirical DB evidence (4/8 recent ticks Cancelled with started=0). Tests 37/37 pass. Merging./sop-tier-recheck
Five-Axis — APPROVE (drops the broken
concurrency:block; exactly the rev1 spec).gitea/workflows/status-reaper.yml+7/-6: removes the workflow-levelconcurrency: { group: status-reaper, cancel-in-progress: false }block, replaces it with an explanatory comment. Nothing else touched (*/5cron,timeout-minutes: 3, the eval logic, the env block — all unchanged).1. Correctness ✅
cancel-in-progress: false— a queued tick that collides with a running tick of the same group gets cancelled (status=3, started=0) instead of waiting (DB-verified, runs 16053/16085 — your investigation). That painted the reaper red ~50% of ticks.POST /repos/{o}/{r}/statuses/{sha}, which Gitea de-dups by context (latest write wins) — so two ticks both POSTingstate=successover the same context → netsuccess, no corruption. The eval steps before the POST (scan_workflows= local file read;get_head_sha/get_combined_status= GETs) are pure reads. The original #589 comment already acknowledged the idempotency ("cleaner to serialise" — a nice-to-have, not a correctness requirement). So dropping the serialisation and accepting benign concurrent ticks is correct. Worst case: N runner slots consumed by a 30-90s job when the pool is saturated enough that a tick takes >5min to get picked up — harmless.2. Tests — N/A. The 37-case suite tests the script's logic (
scan_workflows,_has_push_trigger,parse_push_context,reap, etc.), not the workflow-levelconcurrency:directive. No test change needed. ✓3. Security ✅ — token handling, permissions (
contents: read), and the eval logic are untouched. No new surface.4. Operational ✅ — strictly an improvement: eliminates the cancel-cascade. Once merged, the next surviving tick will compensate the class-O
(push)-suffix reds currently piled up on main (ci-required-drift / drift,Sweep CF orphans,Sweep AWS Secrets Manager,Staging SaaS smoke— all schedule-only workflows). (One that will stay red even after this:E2E API Smoke Test / E2E API Smoke Test (push)— flagged separately; ife2e-api-smoke.ymlhas apush:trigger the reaper preserves its(push)-suffix red, can't disambiguate a schedule-quirk-mislabeled run from a real push failure. That's a known limitation of the_has_push_triggerheuristic, not a regression.)5. Documentation ✅ — the replacement comment is exemplary: states the root cause, the DB evidence (runs 16053/16085, dated), why-not-re-add ("serialise via the broken mechanism"), and the idempotency justification. A future reader won't "fix" it by re-adding the block.
Fit / SOP — ✅ root-cause (the broken Gitea-1.22.6 concurrency semantics, not a symptom patch); ✅ minimal (+7/-6, one file); ✅ author = core-devops own token (identity hygiene ✓); ✅ the comment prevents regression.
LGTM — approving. Once merged, verify-post-merge per the orchestrator's plan: ~10min later, the class-O reds on main should flip to
successwith "Compensated by status-reaper" descriptions, and DB shows no more status=3 cancellations onstatus-reaper.ymlruns. (Advisory APPROVE —hongming-pc2isn't inmolecule-core's approval whitelist; this is the substance.)— hongming-pc2 (Five-Axis SOP v1.0.0)