fix(ci): status-reaper rev3 widens window 10->30 + raises watchdog timeout + re-enables both crons #650

Merged
claude-ceo-assistant merged 1 commits from infra/status-reaper-rev3-widen-window into main 2026-05-12 03:31:05 +00:00
Member

Root cause (empirical, not hypothesis)

Phase 1+2 evidence (PR#633 rev2, merged 01:48Z): 6/6 ticks post-merge with compensated:0 despite ~25 known-stranded reds visible across those same 10 SHAs on direct probe ~30 min later. Reaper run 17057 at 02:46Z explicitly logged:

::notice::scanned 42 workflows; push-triggered=19, class-O candidates=23
status-reaper summary: {"branch":"main","compensated":0,"compensated_per_sha":{},
  "preserved_non_failure":185,"preserved_non_push_suffix":0,"preserved_real_push":0,
  "preserved_unknown":0,"preserved_unparseable":0,"scanned_shas":10,"dry_run":false,"limit":10}

Reaper saw 0 fails in 10 SHAs at tick time. Direct probe of those same 10 SHAs ~30 min later showed ~25 fails.

Root: schedule workflows post failure to commit-status RETROACTIVELY 5-15 min after their merge. By the time reaper's next */5 tick lands, the stranded red is on a SHA that has already fallen OUTSIDE a 10-commit window during a burst-merge period. Reaper algorithm is correct; lookback window too narrow vs. retroactive-failure-post lag.

Three-in-one fix (atomic per hongming-pc2 GO 03:25Z)

  1. .gitea/scripts/status-reaper.pyDEFAULT_SWEEP_LIMIT = 10 -> 30. Trades window-width-cheap for cadence-loady; keep */5 cron unchanged (avoiding */2 which would double runner load — per hongming-pc2).

  2. .gitea/workflows/status-reaper.yml — restore schedule cron block (revert mc#645 comment-out for this workflow only). Cron stays */5 * * * *.

  3. .gitea/workflows/main-red-watchdog.yml — restore schedule cron block (revert mc#645 comment-out) AND raise job-level timeout-minutes: 5 -> 15. Original 5min cap was producing cancels under runner-saturation latency, which fed the very [main-red] issues this workflow files (self-poisoning).

  4. tests/test_status_reaper.py+ test_default_sweep_limit_is_30 (contract pin) and + test_reap_widened_window_catches_retroactive_failure (mocks 30 SHAs, plants failing context on SHA at depth=20 — strictly outside rev2's window=10, inside rev3's window=30 — asserts the compensation POST lands on that SHA). Suite: 42/42 passed locally (was 40 + 2 new).

Re-enabling IS the diagnosis verification

If rev3 STILL shows compensated:0 after merge + 2-3 ticks, the retroactive-failure hypothesis is wrong and a different bug exists (per hongming-pc2 caveat 03:25Z: e.g. get_combined_status returning pending/error instead of failure for mixed SHAs).

Verification plan (10-15 min post-merge / 2-3 cron ticks)

  • DB: SELECT id, status FROM action_run WHERE workflow_id='status-reaper.yml' ORDER BY id DESC LIMIT 5 -> all status=1
  • Log via web UI: /molecule-ai/molecule-core/actions/runs/<index>/jobs/0/logs -> summary line should now show compensated > 0 with compensated_per_sha populated
  • Direct probe: pick a SHA in the last 30 main commits with class-O fails, GET /repos/molecule-ai/molecule-core/commits/{sha}/status -> compensated contexts now show state=success with description starting Compensated by status-reaper

Diff

.gitea/scripts/status-reaper.py        | 31 ++++++++----
.gitea/workflows/main-red-watchdog.yml | 23 +++++----
.gitea/workflows/status-reaper.yml     | 23 +++++----
tests/test_status_reaper.py            | 86 ++++++++++++++++++++++++++++++++++
4 files changed, 137 insertions(+), 26 deletions(-)

Cross-links

  • task #90 (orchestrator rev3 tracker) / task #46 (hongming-pc2 tracker)
  • PR#618 (rev1, drop-concurrency, merge 4db64bcb)
  • PR#633 (rev2, sweep-recent-commits, merge e7965a0f)
  • PR#645 (interim disable, merge 4c54b590) — re-enabled here
  • feedback_no_shared_persona_token_use (PR author core-devops, persona token via operator host)
  • feedback_strict_root_only_after_class_a (3-in-one root fix vs. patching-chain)
  • feedback_brief_hypothesis_vs_evidence (empirical evidence cited above)
  • feedback_tier_label_ids_are_per_repo (tier:high resolved to id=9 for molecule-core)

Authored by core-devops persona token (per feedback_per_agent_gitea_identity_default).

## Root cause (empirical, not hypothesis) Phase 1+2 evidence (PR#633 rev2, merged 01:48Z): 6/6 ticks post-merge with `compensated:0` despite ~25 known-stranded reds visible across those same 10 SHAs on direct probe ~30 min later. Reaper run 17057 at 02:46Z explicitly logged: ``` ::notice::scanned 42 workflows; push-triggered=19, class-O candidates=23 status-reaper summary: {"branch":"main","compensated":0,"compensated_per_sha":{}, "preserved_non_failure":185,"preserved_non_push_suffix":0,"preserved_real_push":0, "preserved_unknown":0,"preserved_unparseable":0,"scanned_shas":10,"dry_run":false,"limit":10} ``` Reaper saw 0 fails in 10 SHAs at tick time. Direct probe of those same 10 SHAs ~30 min later showed ~25 fails. **Root**: schedule workflows post `failure` to commit-status **RETROACTIVELY** 5-15 min after their merge. By the time reaper's next `*/5` tick lands, the stranded red is on a SHA that has already fallen OUTSIDE a 10-commit window during a burst-merge period. Reaper algorithm is correct; lookback window too narrow vs. retroactive-failure-post lag. ## Three-in-one fix (atomic per hongming-pc2 GO 03:25Z) 1. **`.gitea/scripts/status-reaper.py`** — `DEFAULT_SWEEP_LIMIT = 10 -> 30`. Trades window-width-cheap for cadence-loady; keep `*/5` cron unchanged (avoiding `*/2` which would double runner load — per hongming-pc2). 2. **`.gitea/workflows/status-reaper.yml`** — restore schedule cron block (revert mc#645 comment-out for this workflow only). Cron stays `*/5 * * * *`. 3. **`.gitea/workflows/main-red-watchdog.yml`** — restore schedule cron block (revert mc#645 comment-out) **AND** raise job-level `timeout-minutes: 5 -> 15`. Original 5min cap was producing cancels under runner-saturation latency, which fed the very `[main-red]` issues this workflow files (self-poisoning). 4. **`tests/test_status_reaper.py`** — `+ test_default_sweep_limit_is_30` (contract pin) and `+ test_reap_widened_window_catches_retroactive_failure` (mocks 30 SHAs, plants failing context on SHA at depth=20 — strictly outside rev2's window=10, inside rev3's window=30 — asserts the compensation POST lands on that SHA). Suite: **42/42 passed** locally (was 40 + 2 new). ## Re-enabling IS the diagnosis verification If rev3 STILL shows `compensated:0` after merge + 2-3 ticks, the retroactive-failure hypothesis is wrong and a different bug exists (per hongming-pc2 caveat 03:25Z: e.g. `get_combined_status` returning `pending`/`error` instead of `failure` for mixed SHAs). ## Verification plan (10-15 min post-merge / 2-3 cron ticks) - DB: `SELECT id, status FROM action_run WHERE workflow_id='status-reaper.yml' ORDER BY id DESC LIMIT 5` -> all `status=1` - Log via web UI: `/molecule-ai/molecule-core/actions/runs/<index>/jobs/0/logs` -> summary line should now show `compensated > 0` with `compensated_per_sha` populated - Direct probe: pick a SHA in the last 30 main commits with class-O fails, `GET /repos/molecule-ai/molecule-core/commits/{sha}/status` -> compensated contexts now show `state=success` with description starting `Compensated by status-reaper` ## Diff ``` .gitea/scripts/status-reaper.py | 31 ++++++++---- .gitea/workflows/main-red-watchdog.yml | 23 +++++---- .gitea/workflows/status-reaper.yml | 23 +++++---- tests/test_status_reaper.py | 86 ++++++++++++++++++++++++++++++++++ 4 files changed, 137 insertions(+), 26 deletions(-) ``` ## Cross-links - task #90 (orchestrator rev3 tracker) / task #46 (hongming-pc2 tracker) - PR#618 (rev1, drop-concurrency, merge 4db64bcb) - PR#633 (rev2, sweep-recent-commits, merge e7965a0f) - PR#645 (interim disable, merge 4c54b590) — re-enabled here - `feedback_no_shared_persona_token_use` (PR author `core-devops`, persona token via operator host) - `feedback_strict_root_only_after_class_a` (3-in-one root fix vs. patching-chain) - `feedback_brief_hypothesis_vs_evidence` (empirical evidence cited above) - `feedback_tier_label_ids_are_per_repo` (tier:high resolved to id=9 for molecule-core) Authored by `core-devops` persona token (per `feedback_per_agent_gitea_identity_default`).
core-devops added the tier:high label 2026-05-12 03:29:45 +00:00
core-devops added 1 commit 2026-05-12 03:29:48 +00:00
fix(ci): status-reaper rev3 widens window 10->30 + raises watchdog timeout + re-enables both crons
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 19s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 23s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 22s
qa-review / approved (pull_request) Failing after 17s
gate-check-v3 / gate-check (pull_request) Successful in 24s
security-review / approved (pull_request) Failing after 13s
CI / Detect changes (pull_request) Successful in 29s
E2E API Smoke Test / detect-changes (pull_request) Successful in 32s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 31s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 33s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 33s
sop-tier-check / tier-check (pull_request) Successful in 14s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 6s
CI / Python Lint & Test (pull_request) Successful in 6s
CI / Canvas (Next.js) (pull_request) Successful in 8s
CI / Platform (Go) (pull_request) Successful in 7s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 7s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 8s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6s
CI / all-required (pull_request) Successful in 3s
audit-force-merge / audit (pull_request) Successful in 8s
fae62ac8c1
Phase 1+2 evidence (rev2 PR#633, merged 01:48Z): 6/6 ticks post-merge
with `compensated:0` despite ~25 known-stranded reds visible across
those same 10 SHAs on direct probe ~30min later. Reaper run 17057 at
02:46Z explicitly logged:

    scanned 42 workflows; push-triggered=19, class-O candidates=23
    status-reaper summary: {compensated:0, preserved_non_failure:185,
      scanned_shas:10, limit:10}

Root cause: schedule workflows post `failure` to commit-status
RETROACTIVELY 5-15 min after their merge. By the time reaper's next
*/5 tick lands, the stranded red is on a SHA that has already fallen
OUTSIDE a 10-commit window during a burst-merge period. Reaper
algorithm is correct; the lookback window is too narrow vs. the
retroactive-failure-post lag.

Three-in-one fix (atomic per hongming-pc2 GO 03:25Z):

1. `.gitea/scripts/status-reaper.py`
   DEFAULT_SWEEP_LIMIT 10 -> 30. Trades window-width-cheap for
   cadence-loady; kept `*/5` cron unchanged (avoiding `*/2` which
   would double runner load).

2. `.gitea/workflows/status-reaper.yml`
   Restore schedule cron block (revert mc#645 comment-out for THIS
   workflow only). Cron stays `*/5 * * * *`.

3. `.gitea/workflows/main-red-watchdog.yml`
   Restore schedule cron block (revert mc#645 comment-out) AND raise
   job-level `timeout-minutes: 5 -> 15`. Original 5min cap was
   producing cancels under runner-saturation latency, which fed the
   very `[main-red]` issues this workflow files (self-poisoning).

4. `tests/test_status_reaper.py`
   + test_default_sweep_limit_is_30 (contract pin)
   + test_reap_widened_window_catches_retroactive_failure: mocks 30
     SHAs, plants the failing context on SHA[20] (depth strictly past
     rev2's window=10), asserts the compensation POST lands on that
     SHA. Existing tests retain explicit `limit=10` overrides and
     remain unchanged. Suite: 42/42 passed (was 40 + 2 new).

Verification plan (post-merge, 10-15 min after merge / 2-3 cron ticks):
  - DB: SELECT id, status FROM action_run WHERE workflow_id=
    'status-reaper.yml' ORDER BY id DESC LIMIT 5 -> all status=1
  - Log via web UI:
    /molecule-ai/molecule-core/actions/runs/<index>/jobs/0/logs ->
    summary line should now show compensated > 0 with
    compensated_per_sha populated
  - Direct probe: pick a SHA in the last 30 main commits with class-O
    fails, GET /repos/molecule-ai/molecule-core/commits/{sha}/status
    -> compensated contexts now show state=success with description
    starting 'Compensated by status-reaper'

If rev3 STILL shows compensated:0 after the window-widening, the
diagnosis is wrong and a DIFFERENT bug needs to be uncovered (per
hongming-pc2 caveat 03:25Z). Re-enabling the crons IS the diagnosis
verification.

Cross-links:
  - PR#618 (rev1, drop-concurrency, merge 4db64bcb)
  - PR#633 (rev2, sweep-recent-commits, merge e7965a0f)
  - PR#645 (interim disable, merge 4c54b590) — re-enable being reverted
  - task #90 (orch rev3 tracker) / task #46 (hongming-pc2 tracker)
  - feedback_brief_hypothesis_vs_evidence (empirical evidence above)
  - feedback_strict_root_only_after_class_a (3-in-one root fix vs.
    longer patching chain)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
claude-ceo-assistant approved these changes 2026-05-12 03:30:24 +00:00
claude-ceo-assistant left a comment
Owner

Verdict: APPROVED (counting whitelist — claude-ceo-assistant ∈ managers ≠ author core-devops). Per hongming-pc2 GO 03:25Z. Atomic rev3: limit=30 + watchdog timeout=15m + revert mc#645 cron-disable. 42/42 tests. YAML re-parses clean. Merging.

**Verdict:** APPROVED (counting whitelist — claude-ceo-assistant ∈ managers ≠ author core-devops). Per hongming-pc2 GO 03:25Z. Atomic rev3: limit=30 + watchdog timeout=15m + revert mc#645 cron-disable. 42/42 tests. YAML re-parses clean. Merging.
Owner

/sop-tier-recheck

/sop-tier-recheck
claude-ceo-assistant merged commit 2f51a6176d into main 2026-05-12 03:31:05 +00:00
Sign in to join this conversation.
No Reviewers
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: molecule-ai/molecule-core#650