fix(ci): status-reaper rev3 widens window 10->30 + raises watchdog timeout + re-enables both crons #650
Reference in New Issue
Block a user
Delete Branch "infra/status-reaper-rev3-widen-window"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Root cause (empirical, not hypothesis)
Phase 1+2 evidence (PR#633 rev2, merged 01:48Z): 6/6 ticks post-merge with
compensated:0despite ~25 known-stranded reds visible across those same 10 SHAs on direct probe ~30 min later. Reaper run 17057 at 02:46Z explicitly logged:Reaper saw 0 fails in 10 SHAs at tick time. Direct probe of those same 10 SHAs ~30 min later showed ~25 fails.
Root: schedule workflows post
failureto commit-status RETROACTIVELY 5-15 min after their merge. By the time reaper's next*/5tick lands, the stranded red is on a SHA that has already fallen OUTSIDE a 10-commit window during a burst-merge period. Reaper algorithm is correct; lookback window too narrow vs. retroactive-failure-post lag.Three-in-one fix (atomic per hongming-pc2 GO 03:25Z)
.gitea/scripts/status-reaper.py—DEFAULT_SWEEP_LIMIT = 10 -> 30. Trades window-width-cheap for cadence-loady; keep*/5cron unchanged (avoiding*/2which would double runner load — per hongming-pc2)..gitea/workflows/status-reaper.yml— restore schedule cron block (revert mc#645 comment-out for this workflow only). Cron stays*/5 * * * *..gitea/workflows/main-red-watchdog.yml— restore schedule cron block (revert mc#645 comment-out) AND raise job-leveltimeout-minutes: 5 -> 15. Original 5min cap was producing cancels under runner-saturation latency, which fed the very[main-red]issues this workflow files (self-poisoning).tests/test_status_reaper.py—+ test_default_sweep_limit_is_30(contract pin) and+ test_reap_widened_window_catches_retroactive_failure(mocks 30 SHAs, plants failing context on SHA at depth=20 — strictly outside rev2's window=10, inside rev3's window=30 — asserts the compensation POST lands on that SHA). Suite: 42/42 passed locally (was 40 + 2 new).Re-enabling IS the diagnosis verification
If rev3 STILL shows
compensated:0after merge + 2-3 ticks, the retroactive-failure hypothesis is wrong and a different bug exists (per hongming-pc2 caveat 03:25Z: e.g.get_combined_statusreturningpending/errorinstead offailurefor mixed SHAs).Verification plan (10-15 min post-merge / 2-3 cron ticks)
SELECT id, status FROM action_run WHERE workflow_id='status-reaper.yml' ORDER BY id DESC LIMIT 5-> allstatus=1/molecule-ai/molecule-core/actions/runs/<index>/jobs/0/logs-> summary line should now showcompensated > 0withcompensated_per_shapopulatedGET /repos/molecule-ai/molecule-core/commits/{sha}/status-> compensated contexts now showstate=successwith description startingCompensated by status-reaperDiff
Cross-links
4db64bcb)e7965a0f)4c54b590) — re-enabled herefeedback_no_shared_persona_token_use(PR authorcore-devops, persona token via operator host)feedback_strict_root_only_after_class_a(3-in-one root fix vs. patching-chain)feedback_brief_hypothesis_vs_evidence(empirical evidence cited above)feedback_tier_label_ids_are_per_repo(tier:high resolved to id=9 for molecule-core)Authored by
core-devopspersona token (perfeedback_per_agent_gitea_identity_default).Phase 1+2 evidence (rev2 PR#633, merged 01:48Z): 6/6 ticks post-merge with `compensated:0` despite ~25 known-stranded reds visible across those same 10 SHAs on direct probe ~30min later. Reaper run 17057 at 02:46Z explicitly logged: scanned 42 workflows; push-triggered=19, class-O candidates=23 status-reaper summary: {compensated:0, preserved_non_failure:185, scanned_shas:10, limit:10} Root cause: schedule workflows post `failure` to commit-status RETROACTIVELY 5-15 min after their merge. By the time reaper's next */5 tick lands, the stranded red is on a SHA that has already fallen OUTSIDE a 10-commit window during a burst-merge period. Reaper algorithm is correct; the lookback window is too narrow vs. the retroactive-failure-post lag. Three-in-one fix (atomic per hongming-pc2 GO 03:25Z): 1. `.gitea/scripts/status-reaper.py` DEFAULT_SWEEP_LIMIT 10 -> 30. Trades window-width-cheap for cadence-loady; kept `*/5` cron unchanged (avoiding `*/2` which would double runner load). 2. `.gitea/workflows/status-reaper.yml` Restore schedule cron block (revert mc#645 comment-out for THIS workflow only). Cron stays `*/5 * * * *`. 3. `.gitea/workflows/main-red-watchdog.yml` Restore schedule cron block (revert mc#645 comment-out) AND raise job-level `timeout-minutes: 5 -> 15`. Original 5min cap was producing cancels under runner-saturation latency, which fed the very `[main-red]` issues this workflow files (self-poisoning). 4. `tests/test_status_reaper.py` + test_default_sweep_limit_is_30 (contract pin) + test_reap_widened_window_catches_retroactive_failure: mocks 30 SHAs, plants the failing context on SHA[20] (depth strictly past rev2's window=10), asserts the compensation POST lands on that SHA. Existing tests retain explicit `limit=10` overrides and remain unchanged. Suite: 42/42 passed (was 40 + 2 new). Verification plan (post-merge, 10-15 min after merge / 2-3 cron ticks): - DB: SELECT id, status FROM action_run WHERE workflow_id= 'status-reaper.yml' ORDER BY id DESC LIMIT 5 -> all status=1 - Log via web UI: /molecule-ai/molecule-core/actions/runs/<index>/jobs/0/logs -> summary line should now show compensated > 0 with compensated_per_sha populated - Direct probe: pick a SHA in the last 30 main commits with class-O fails, GET /repos/molecule-ai/molecule-core/commits/{sha}/status -> compensated contexts now show state=success with description starting 'Compensated by status-reaper' If rev3 STILL shows compensated:0 after the window-widening, the diagnosis is wrong and a DIFFERENT bug needs to be uncovered (per hongming-pc2 caveat 03:25Z). Re-enabling the crons IS the diagnosis verification. Cross-links: - PR#618 (rev1, drop-concurrency, merge4db64bcb) - PR#633 (rev2, sweep-recent-commits, mergee7965a0f) - PR#645 (interim disable, merge4c54b590) — re-enable being reverted - task #90 (orch rev3 tracker) / task #46 (hongming-pc2 tracker) - feedback_brief_hypothesis_vs_evidence (empirical evidence above) - feedback_strict_root_only_after_class_a (3-in-one root fix vs. longer patching chain) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Verdict: APPROVED (counting whitelist — claude-ceo-assistant ∈ managers ≠ author core-devops). Per hongming-pc2 GO 03:25Z. Atomic rev3: limit=30 + watchdog timeout=15m + revert mc#645 cron-disable. 42/42 tests. YAML re-parses clean. Merging.
/sop-tier-recheck