Compare commits

...

6 Commits

Author SHA1 Message Date
core-devops 43f00ddaec docs(runbooks): add Gitea Actions operational quirks reference
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 25s
CI / Detect changes (pull_request) Successful in 26s
E2E API Smoke Test / detect-changes (pull_request) Successful in 31s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 41s
E2E Chat / detect-changes (pull_request) Successful in 26s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 23s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 24s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 23s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 2m15s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 21s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 18s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 2m0s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 2m31s
gate-check-v3 / gate-check (pull_request) Failing after 28s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Successful in 2m41s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 2m20s
qa-review / approved (pull_request) Failing after 43s
security-review / approved (pull_request) Failing after 34s
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 2m48s
sop-checklist / all-items-acked (pull_request) Successful in 49s
sop-tier-check / tier-check (pull_request) Successful in 43s
CI / Python Lint & Test (pull_request) Successful in 8m40s
CI / Canvas (Next.js) (pull_request) Failing after 14m9s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
CI / all-required (pull_request) Failing after 14m30s
CI / Platform (Go) (pull_request) Failing after 26m56s
E2E Chat / E2E Chat (pull_request) Successful in 30s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 20s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Failing after 1m32s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Failing after 2m16s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 3m45s
Documents four persistent Gitea 1.22.6 Actions quirks discovered
during the 2026-05-11 CI noise investigation (PR #441):

- Runner network isolation: git remote unreachable from container
- continue-on-error only at step level: job-level flag ignored
- workflow_dispatch.inputs not supported: parser rejects at load time
- fetch-depth:0 times out: use fetch-depth:1 + Compare API

Closes #457.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:09:37 +00:00
core-devops 004599750f fix(ci): replace fetch-depth: 0 with targeted shallow fetch in detect-changes
Root cause of mc#1314: detect-changes jobs in CI/E2E workflows were
running `fetch-depth: 0` (full repository history clone) before
computing the git diff. On large repositories this takes 10+ minutes,
causing the detect-changes job itself to timeout and fail.

Fix: use `fetch-depth: 1` (shallow clone of HEAD only) plus explicit
`git fetch --depth=1 origin <BASE> --no-walk` to fetch the BASE commit
without its ancestry. This makes detect-changes complete in seconds
instead of minutes.

Files changed:
- ci.yml: changes job
- e2e-api.yml: detect-changes job
- e2e-staging-canvas.yml: detect-changes job
- runtime-prbuild-compat.yml: detect-changes job

Lint workflows (lint-mask-pr-atomicity, lint-required-context-exists-in-bp,
check-migration-collisions, lint-pre-flip-continue-on-error) retain
fetch-depth: 0 because they use `git show <base>:<path>` which needs
the full blob set from the base commit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 13:09:37 +00:00
devops-engineer 8e754e6b28 Merge pull request 'fix(workspace-server): inject /configs token files agent-owned, not root (P0 list_peers 401)' (#1327) from fix/workspace-token-injection-agent-owned into main
Block internal-flavored paths / Block forbidden paths (push) Successful in 15s
CI / Detect changes (push) Successful in 31s
CI / Shellcheck (E2E scripts) (push) Successful in 39s
E2E API Smoke Test / detect-changes (push) Successful in 31s
E2E Chat / detect-changes (push) Successful in 24s
E2E Staging Canvas (Playwright) / detect-changes (push) Successful in 25s
E2E Staging SaaS (full lifecycle) / pr-validate (push) Successful in 59s
Handlers Postgres Integration / detect-changes (push) Successful in 16s
Harness Replays / detect-changes (push) Successful in 16s
Runtime PR-Built Compatibility / detect-changes (push) Successful in 32s
Secret scan / Scan diff for credential-shaped strings (push) Successful in 27s
CI / Python Lint & Test (push) Successful in 8m25s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (push) Successful in 6m53s
publish-workspace-server-image / build-and-push (push) Successful in 14m43s
CI / Canvas (Next.js) (push) Successful in 23m23s
CI / Platform (Go) (push) Successful in 26m50s
CI / all-required (push) Successful in 26m43s
Runtime Pin Compatibility / PyPI-latest install + import smoke (push) Successful in 3m22s
E2E API Smoke Test / E2E API Smoke Test (push) Failing after 7m6s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (push) Successful in 27s
E2E Chat / E2E Chat (push) Failing after 12m54s
Harness Replays / Harness Replays (push) Successful in 9s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (push) Successful in 3m3s
Handlers Postgres Integration / Handlers Postgres Integration (push) Successful in 7m4s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (push) Successful in 2m12s
CI / Canvas Deploy Reminder (push) Successful in 9s
publish-workspace-server-image / Production auto-deploy (push) Successful in 2m22s
main-red-watchdog / watchdog (push) Successful in 49s
Railway pin audit (drift detection) / Audit Railway env vars for drift-prone pins (push) Compensated by status-reaper (workflow has no push: trigger; Gitea 1.22.6 hardcoded-suffix bug — see .gitea/scripts/status-reaper.py)
gate-check-v3 / gate-check (push) Successful in 50s
Sweep stale Cloudflare DNS records / Sweep CF orphans (push) Successful in 23s
ci-required-drift / drift (push) Successful in 1m27s
Staging SaaS smoke (every 30 min) / Staging SaaS smoke (push) Successful in 4m45s
Continuous synthetic E2E (staging) / Synthetic E2E against staging (push) Successful in 5m31s
Sweep stale e2e-* orgs (staging) / Sweep e2e orgs (push) Successful in 16s
status-reaper / reap (push) Has started running
gitea-merge-queue / queue (push) Has started running
Sweep stale Cloudflare Tunnels / Sweep CF tunnels (push) Successful in 32s
2026-05-16 12:51:57 +00:00
devops-engineer 8179ff77e9 Merge branch 'main' into fix/workspace-token-injection-agent-owned
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 16s
CI / Detect changes (pull_request) Successful in 27s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 43s
E2E API Smoke Test / detect-changes (pull_request) Successful in 58s
E2E Chat / detect-changes (pull_request) Successful in 59s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 1m0s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 24s
Harness Replays / detect-changes (pull_request) Successful in 26s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 1m8s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 24s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 29s
qa-review / approved (pull_request) Failing after 27s
gate-check-v3 / gate-check (pull_request) Successful in 39s
security-review / approved (pull_request) Failing after 27s
sop-checklist / all-items-acked (pull_request) Successful in 19s
sop-tier-check / tier-check (pull_request) Successful in 22s
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Successful in 11s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m39s
Harness Replays / Harness Replays (pull_request) Successful in 10s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Successful in 18s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 2m9s
CI / Python Lint & Test (pull_request) Successful in 7m58s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 6m41s
E2E Chat / E2E Chat (pull_request) Failing after 8m24s
CI / Platform (Go) (pull_request) Successful in 16m7s
CI / Canvas (Next.js) (pull_request) Successful in 16m52s
CI / all-required (pull_request) Successful in 30m58s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
audit-force-merge / audit (pull_request) Successful in 14s
2026-05-16 12:05:32 +00:00
infra-runtime-be 6188c6ddf3 fix(org_helpers): correct duplicate phrase in loadWorkspaceEnv comment
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Blocked by required conditions
sop-checklist / na-declarations (pull_request) N/A: (none)
gate-check-v3 / gate-check (pull_request) Waiting to run
audit-force-merge / audit (pull_request) Has been skipped
Block internal-flavored paths / Block forbidden paths (pull_request) Successful in 19s
CI / Detect changes (pull_request) Successful in 19s
CI / Shellcheck (E2E scripts) (pull_request) Successful in 25s
E2E API Smoke Test / detect-changes (pull_request) Successful in 34s
E2E Peer Visibility (literal MCP list_peers) / E2E Peer Visibility (pull_request) Successful in 21s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Successful in 37s
Handlers Postgres Integration / detect-changes (pull_request) Successful in 23s
Harness Replays / detect-changes (pull_request) Successful in 22s
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Successful in 1m0s
Lint curl status-code capture / Scan workflows for curl status-capture pollution (pull_request) Successful in 24s
lint-required-no-paths / lint-required-no-paths (pull_request) Successful in 1m32s
Lint pre-flip continue-on-error / Verify continue-on-error flips have run-log proof (pull_request) Successful in 1m59s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Successful in 2m23s
publish-runtime-autobump / bump-and-tag (pull_request) Has been skipped
lint-required-context-exists-in-bp / lint-required-context-exists-in-bp (pull_request) Successful in 2m16s
Lint workflow YAML (Gitea-1.22.6-hostile shapes) / Lint workflow YAML for Gitea-1.22.6-hostile shapes (pull_request) Successful in 1m58s
lint-continue-on-error-tracking / lint-continue-on-error-tracking (pull_request) Successful in 3m23s
publish-runtime-autobump / pr-validate (pull_request) Successful in 1m0s
Runtime PR-Built Compatibility / detect-changes (pull_request) Successful in 38s
Secret scan / Scan diff for credential-shaped strings (pull_request) Successful in 25s
CI / Python Lint & Test (pull_request) Successful in 7m22s
qa-review / approved (pull_request) Failing after 31s
E2E Staging External Runtime / E2E Staging External Runtime (pull_request) Successful in 5m26s
sop-checklist / all-items-acked (pull_request) Successful in 26s
security-review / approved (pull_request) Failing after 34s
sop-tier-check / tier-check (pull_request) Successful in 32s
Ops Scripts Tests / Ops scripts (unittest) (pull_request) Failing after 1m30s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Successful in 6m56s
CI / Canvas (Next.js) (pull_request) Successful in 19m13s
CI / Canvas Deploy Reminder (pull_request) Successful in 12s
CI / Platform (Go) (pull_request) Successful in 20m10s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Successful in 8m2s
CI / all-required (pull_request) Successful in 20m5s
Harness Replays / Harness Replays (pull_request) Has been cancelled
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Has been cancelled
The comment had the phrase "the workspace-specific .env" duplicated.
Removed the redundant repetition.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 10:27:13 +00:00
core-be f986444dbd fix(workspace-server): inject /configs token files agent-owned, not root
Block internal-flavored paths / Block forbidden paths (pull_request) Failing after 0s
CI / Platform (Go) (pull_request) Failing after 0s
CI / Detect changes (pull_request) Failing after 0s
CI / Shellcheck (E2E scripts) (pull_request) Failing after 0s
CI / Python Lint & Test (pull_request) Failing after 0s
CI / all-required (pull_request) Failing after 1s
CI / Canvas Deploy Reminder (pull_request) Has been skipped
E2E API Smoke Test / detect-changes (pull_request) Failing after 0s
CI / Canvas (Next.js) (pull_request) Failing after 1s
E2E Staging Canvas (Playwright) / detect-changes (pull_request) Failing after 1s
E2E API Smoke Test / E2E API Smoke Test (pull_request) Has been skipped
E2E Staging Canvas (Playwright) / Canvas tabs E2E (pull_request) Has been skipped
E2E Staging SaaS (full lifecycle) / pr-validate (pull_request) Failing after 0s
E2E Staging SaaS (full lifecycle) / E2E Staging SaaS (pull_request) Has been skipped
Handlers Postgres Integration / detect-changes (pull_request) Failing after 0s
Handlers Postgres Integration / Handlers Postgres Integration (pull_request) Has been skipped
Harness Replays / detect-changes (pull_request) Failing after 0s
Harness Replays / Harness Replays (pull_request) Has been skipped
lint-required-no-paths / lint-required-no-paths (pull_request) Failing after 0s
Runtime PR-Built Compatibility / detect-changes (pull_request) Failing after 0s
Runtime PR-Built Compatibility / PR-built wheel + import smoke (pull_request) Has been skipped
Secret scan / Scan diff for credential-shaped strings (pull_request) Failing after 0s
qa-review / approved (pull_request) Failing after 0s
security-review / approved (pull_request) Failing after 0s
gate-check-v3 / gate-check (pull_request) Failing after 21s
sop-tier-check / tier-check (pull_request) Successful in 21s
lint-mask-pr-atomicity / lint-mask-pr-atomicity (pull_request) Successful in 1m21s
sop-checklist / all-items-acked (pull_request) acked: 0/7 — missing: comprehensive-testing, local-postgres-e2e, staging-smoke, +4
sop-checklist / na-declarations (pull_request) N/A: (none)
The fleet-wide list_peers 401 (Hermes et al): two workspace-server
token-injection paths wrote /configs/.auth_token (and
/configs/.platform_inbound_secret) as root:root 0600 AFTER the template
entrypoint's `chown -R agent:agent /configs` ran. The a2a_mcp_server runs
as the agent uid (1000, via `gosu agent`), so platform_auth.get_token()
hit `[Errno 13] Permission denied` → empty bearer → platform 401 on
/registry/{id}/peers (the literal tool_list_peers path).

PR#23 fixed only the entrypoint dir chown (first boot); it cannot reach
the post-entrypoint root re-injection. This covers both injection paths:

1. WriteAuthTokenToVolume (#1877, pre-start): the throwaway alpine
   container ran chmod 0600 but never chowned — alpine runs as root, so
   the file stayed root:root. Now `chown 1000:1000 /vol/.auth_token`
   (0600 preserved).
2. WriteFilesToContainer (#418, post-start re-injection): the tar headers
   left Uid/Gid unset → CopyToContainer extracted root:root. Now every
   tar entry is stamped Uid/Gid = agent. This path (re)writes BOTH
   .auth_token and .platform_inbound_secret, so both are fixed.

uid 1000:1000 verified from the templates (claude-code-default + hermes
Dockerfile `useradd -u 1000 ... agent`, entrypoint `gosu agent`), exposed
as AgentUID/AgentGID constants. Tar-build and alpine-cmd extracted into
pure helpers (mirrors buildTemplateTar) so the ownership contract is
unit-tested without a live Docker daemon; the test fails on pre-fix
root:root and passes post-fix (real tar / real command, not a mock).

PR#23's entrypoint chown is unchanged (still correct for the dir +
first boot). No feature flag, no backwards-compat shim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 02:19:11 -07:00
8 changed files with 342 additions and 17 deletions
+12 -1
View File
@@ -84,7 +84,11 @@ jobs:
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
# Shallow clone: only the PR head tip. BASE commit is fetched below.
# mc#1314 fix: was fetch-depth: 0 (full history clone), which caused
# detect-changes to hang for 10+ minutes on large repos. The diff only
# needs HEAD + BASE, so we fetch those two commits explicitly.
fetch-depth: 1
- id: check
run: |
# For PR events: diff against the base branch (not HEAD~1 of the branch,
@@ -107,6 +111,13 @@ jobs:
echo "scripts=true" >> "$GITHUB_OUTPUT"
exit 0
fi
# Shallow-fetch the BASE commit explicitly. git fetch --depth=1 with
# --no-walk fetches the commit without its full ancestry (fast), then
# git diff works because both BASE and HEAD are now in the object store.
if ! git cat-file -e "$BASE" 2>/dev/null; then
git fetch --depth=1 origin "$BASE" --no-walk 2>/dev/null || \
git fetch --depth=50 origin "$BASE" 2>/dev/null || true
fi
# Workflow-only edits are covered by the workflow lint family
# and by this workflow's always-present required jobs. Do not fan
# those edits out into Go/Canvas/Python/shellcheck work; the
+10 -2
View File
@@ -117,7 +117,11 @@ jobs:
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
# Shallow clone: only the PR head tip. BASE commit is fetched below.
# mc#1314 fix: was fetch-depth: 0 (full history clone), which caused
# detect-changes to hang for 10+ minutes on large repos. The diff only
# needs HEAD + BASE, so we fetch those two commits explicitly.
fetch-depth: 1
- id: decide
# Inline replacement for dorny/paths-filter — same pattern PR#372's
# ci.yml port used. Diffs against the PR base or push BEFORE SHA,
@@ -131,8 +135,12 @@ jobs:
echo "api=true" >> "$GITHUB_OUTPUT"
exit 0
fi
# Shallow-fetch the BASE commit explicitly. git fetch --depth=1 with
# --no-walk fetches the commit without its full ancestry (fast), then
# git diff works because both BASE and HEAD are now in the object store.
if ! git cat-file -e "$BASE" 2>/dev/null; then
git fetch --depth=1 origin "$BASE" 2>/dev/null || true
git fetch --depth=1 origin "$BASE" --no-walk 2>/dev/null || \
git fetch --depth=50 origin "$BASE" 2>/dev/null || true
fi
if ! git cat-file -e "$BASE" 2>/dev/null; then
echo "api=true" >> "$GITHUB_OUTPUT"
+10 -2
View File
@@ -77,7 +77,11 @@ jobs:
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
# Shallow clone: only the PR head tip. BASE commit is fetched below.
# mc#1314 fix: was fetch-depth: 0 (full history clone), which caused
# detect-changes to hang for 10+ minutes on large repos. The diff only
# needs HEAD + BASE, so we fetch those two commits explicitly.
fetch-depth: 1
- id: decide
# Inline replacement for dorny/paths-filter — see e2e-api.yml.
# Cron triggers always run real work (no diff context).
@@ -94,8 +98,12 @@ jobs:
echo "canvas=true" >> "$GITHUB_OUTPUT"
exit 0
fi
# Shallow-fetch the BASE commit explicitly. git fetch --depth=1 with
# --no-walk fetches the commit without its full ancestry (fast), then
# git diff works because both BASE and HEAD are now in the object store.
if ! git cat-file -e "$BASE" 2>/dev/null; then
git fetch --depth=1 origin "$BASE" 2>/dev/null || true
git fetch --depth=1 origin "$BASE" --no-walk 2>/dev/null || \
git fetch --depth=50 origin "$BASE" 2>/dev/null || true
fi
if ! git cat-file -e "$BASE" 2>/dev/null; then
echo "canvas=true" >> "$GITHUB_OUTPUT"
+10 -2
View File
@@ -59,7 +59,11 @@ jobs:
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
# Shallow clone: only the PR head tip. BASE commit is fetched below.
# mc#1314 fix: was fetch-depth: 0 (full history clone), which caused
# detect-changes to hang for 10+ minutes on large repos. The diff only
# needs HEAD + BASE, so we fetch those two commits explicitly.
fetch-depth: 1
- id: decide
run: |
# Inline replacement for dorny/paths-filter — same pattern
@@ -84,8 +88,12 @@ jobs:
echo "wheel=true" >> "$GITHUB_OUTPUT"
exit 0
fi
# Shallow-fetch the BASE commit explicitly. git fetch --depth=1 with
# --no-walk fetches the commit without its full ancestry (fast), then
# git diff works because both BASE and HEAD are now in the object store.
if ! timeout 30 git cat-file -e "$BASE" 2>/dev/null; then
git fetch --depth=1 origin "$BASE" 2>/dev/null || true
git fetch --depth=1 origin "$BASE" --no-walk 2>/dev/null || \
git fetch --depth=50 origin "$BASE" 2>/dev/null || true
fi
if ! timeout 30 git cat-file -e "$BASE" 2>/dev/null; then
echo "wheel=true" >> "$GITHUB_OUTPUT"
+139
View File
@@ -0,0 +1,139 @@
# Gitea Actions Operational Quirks
Four persistent Gitea 1.22.6 Actions quirks discovered during the 2026-05-11 CI noise investigation (PR #441). These are environment-level facts, not bugs to fix — write and review workflows with them in mind.
---
## 1. Runner Network Isolation
**Symptom**: `git fetch`, `git clone`, and other outbound TCP connections from within act_runner job containers silently time out. The git remote (`git.moleculesai.app`) is reachable from the act_runner host process but not from inside the ephemeral job containers.
**Confirmed scope**: all `molecule-runner-*` act_runner containers, which run jobs with their own network namespace (via Docker `--network: host` but with iptables isolation inside the container).
**Impact**: any workflow step that calls `git fetch` or `git clone` inside the job container will hang and eventually time out. This was the root cause of the 2026-05-11 CI noise (PR #441).
### Workarounds
**Prefer API calls over git** (preferred):
- Use the Gitea Compare API (`/api/v1/repos/{owner}/{repo}/compare/{base}...{head}`) instead of `git diff`. The Compare API returns the list of changed files directly without needing git history in the container.
- Example from `harness-replays.yml` `detect-changes` step: `curl -sS "$GITHUB_SERVER_URL/api/v1/repos/$GITHUB_REPOSITORY/compare/$BASE...$HEAD"`
- For push events where SHA-to-branch comparison is rejected (`BaseNotExist`), use the `github.event.commits` array instead — each commit object includes its added/removed/modified file list.
- See `.gitea/scripts/compare-api-diff-files.py` and `.gitea/scripts/push-commits-diff-files.py` for existing helpers.
**If git inside the container is unavoidable**:
- Use `actions/checkout` with `fetch-depth: 1` (shallow clone) — the checkout action runs on the host side and mounts the repo into the container via `git clone --shared`, so it does not hit the container's outbound git limitation.
- The cloned files are available inside the container at the usual path.
- **Do not** run `git fetch` inside a `run:` shell step — it will hang. Use the `actions/checkout` step instead.
**Anti-pattern (do not use)**:
```yaml
# WRONG — hangs in Gitea Actions runner containers
- name: Fetch base ref
run: git fetch origin ${{ github.event.pull_request.base.sha }}
```
---
## 2. `continue-on-error` Only at Step Level
**Symptom**: `continue-on-error: true` set at the **job level** is silently ignored by Gitea 1.22.6. The job will fail the overall workflow run even if all its individual steps succeed except those marked `continue-on-error: true`.
**Impact**: a job-level `continue-on-error` used as an escape hatch for flaky steps will NOT work. The escape hatch must be per-step.
**Correct pattern**:
```yaml
jobs:
my-job:
runs-on: ubuntu-latest
steps:
- name: Flaky step
continue-on-error: true # ← must be here, on the step
run: ./might-fail.sh
- name: Deterministic step
run: ./always-works.sh
```
**Wrong pattern (ignored)**:
```yaml
jobs:
my-job:
runs-on: ubuntu-latest
continue-on-error: true # ← ignored by Gitea 1.22.6; do not rely on this
steps:
- name: Flaky step
run: ./might-fail.sh
```
**Historical context**: this was the root cause of mc#774-style "pre-existing continue-on-error mask" escapes. Before the bug was identified, jobs were using job-level `continue-on-error: true` as an escape hatch; when that stopped working (or was never working on Gitea), the flaky steps leaked failures through. The correct fix is step-level `continue-on-error: true` plus a `mc#314`-tagged comment with a removal date/commit reference so the escape hatch is not permanent.
---
## 3. `workflow_dispatch.inputs` Not Supported
**Symptom**: `workflow_dispatch.inputs` blocks in workflow YAML are rejected by the Gitea 1.22.6 workflow parser with an error at parse time. The workflow will not register.
**Impact**: all workflows ported from GitHub Actions (per RFC internal#219 §1 sweep) dropped their `workflow_dispatch.inputs` blocks. Any future workflow that tries to use manual `workflow_dispatch` inputs will fail.
**Workaround**: use environment variables or secrets as configuration channels instead. For path-filtered manual runs, use `workflow_dispatch` without inputs and gate logic inside the job with `if:` conditions.
**Example — replace inputs with env**:
```yaml
# GitHub Actions (what we used to write):
on:
workflow_dispatch:
inputs:
target:
type: choice
options: [platform, canvas, all]
# Gitea Actions (what we write now):
on:
workflow_dispatch:
# no inputs block — not supported
env:
TARGET: ${{ github.event.inputs.target || 'all' }} # ← undefined; handle inside steps
```
If a choice is needed, document it in the workflow comment and use a separate job or step `if:` condition.
---
## 4. `fetch-depth: 0` Times Out in Container
**Symptom**: `actions/checkout` with `fetch-depth: 0` (full history clone) hangs and times out in Gitea Actions runner containers. The act_runner host can clone fine, but the container's network isolation (see quirk #1) prevents the underlying `git fetch-pack` / `git clone --depth=0` from completing.
**Impact**: any workflow that needs both base and head SHAs locally for `git diff` must not rely on `fetch-depth: 0`.
**Workaround**: use `fetch-depth: 1` (shallow clone) combined with the Gitea Compare API or `github.event.commits` array (see quirk #1). The Compare API returns the same file-diff information without any git history in the container.
**Correct pattern**:
```yaml
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 1 # ← shallow clone only
# Then use Compare API or commits array for changed-file detection
```
**Wrong pattern (hangs)**:
```yaml
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0 # ← times out in Gitea Actions runner containers
```
**Note**: `actions/checkout` itself runs on the host side (the act_runner process) and is not subject to container network isolation — the shallow clone via `fetch-depth: 1` succeeds because the checkout action performs it on the host. The restriction applies only to `run:` shell steps that independently call git.
---
## Enforcement in CI
These quirks are captured as enforceable lint rules in `lint-workflow-yaml.yml`, which runs `.gitea/scripts/lint-workflow-yaml.py` against all `.gitea/workflows/*.yml` files. The script currently covers:
1. `workflow_dispatch.inputs` blocks (rule-1)
2. `on: workflow_run` triggers (rule-2 — Gitea 1.22.6 lacks the event)
3. Job names containing `/` (rule-3 — breaks status-context tokenization)
4. Cross-file job-name collisions (rule-4)
5. `uses: org/repo@sha` pointing at non-molecule repos (rule-5)
6. `api.github.com` URL references without `GITHUB_SERVER_URL` set (rule-6 — warning)
The `fetch-depth: 0` and `git fetch` inside `run:` steps patterns (quirks #1 and #4 above) are not yet covered by automated lint. Until they are, review workflow changes manually for these shapes. Do not add `continue-on-error: true` to the lint job as an escape hatch — if a lint fires for a legitimate reason, fix the workflow, do not suppress the lint.
@@ -177,7 +177,7 @@ func isEnvIdentPart(c byte) bool {
return isEnvIdentStart(c) || (c >= '0' && c <= '9')
}
// loadWorkspaceEnv reads the org root .env and the workspace-specific .env .env and the workspace-specific .env
// loadWorkspaceEnv reads the org root .env and the workspace-specific .env
// (workspace overrides org root). Used by both secret injection and channel
// config expansion.
//
@@ -189,6 +189,24 @@ const containerNamePrefix = "ws-"
// (the wiped-DB case after `docker compose down -v`).
const LabelManaged = "molecule.platform.managed"
// AgentUID / AgentGID are the uid/gid of the unprivileged `agent` user that
// every workspace template creates and drops to via `gosu agent` before
// exec'ing the runtime (the a2a_mcp_server runs under this uid). The value is
// fixed at 1000:1000 across all templates — see:
// - workspace-configs-templates/claude-code-default/Dockerfile (`useradd -u 1000 ... agent`)
// - workspace-configs-templates/hermes/Dockerfile (`useradd -u 1000 ... agent`)
// - workspace/entrypoint.sh (`exec gosu agent` — "uid 1000")
//
// Files the platform injects into /configs AFTER the entrypoint's
// `chown -R agent:agent /configs` (the post-start #418 re-injection and the
// pre-start #1877 volume write) must be owned by this uid/gid, otherwise the
// agent-uid MCP server hits EACCES reading /configs/.auth_token, sends an
// empty bearer, and the platform 401s on /registry/{id}/peers (list_peers).
const (
AgentUID = 1000
AgentGID = 1000
)
// managedLabels is the canonical label map applied to every workspace
// container + volume. Pulled out so a future addition (e.g. instance
// UUID for multi-platform-shared-daemon disambiguation) is one edit.
@@ -862,8 +880,18 @@ func buildTemplateTar(templatePath string) (*bytes.Buffer, error) {
return &buf, nil
}
// WriteFilesToContainer writes in-memory files into /configs in the container.
func (p *Provisioner) WriteFilesToContainer(ctx context.Context, containerID string, files map[string][]byte) error {
// buildConfigFilesTar builds the tar stream that WriteFilesToContainer streams
// into /configs via CopyToContainer. Every entry is stamped Uid/Gid = agent
// (AgentUID/AgentGID) so the files land agent-owned after extraction. This is
// the issue #418 post-start re-injection path: it runs AFTER the template
// entrypoint's `chown -R agent:agent /configs`, so without explicit ownership
// in the tar header the files extract as root:root (tar Uid/Gid default 0) and
// the agent-uid MCP server can no longer read /configs/.auth_token (and
// /configs/.platform_inbound_secret) → empty bearer → list_peers 401.
//
// Pulled out as a pure function so the ownership contract is unit-testable
// without a live Docker daemon (mirrors buildTemplateTar).
func buildConfigFilesTar(files map[string][]byte) (*bytes.Buffer, error) {
var buf bytes.Buffer
tw := tar.NewWriter(&buf)
@@ -876,8 +904,10 @@ func (p *Provisioner) WriteFilesToContainer(ctx context.Context, containerID str
Typeflag: tar.TypeDir,
Name: dir + "/",
Mode: 0755,
Uid: AgentUID,
Gid: AgentGID,
}); err != nil {
return fmt.Errorf("failed to write tar dir header for %s: %w", dir, err)
return nil, fmt.Errorf("failed to write tar dir header for %s: %w", dir, err)
}
createdDirs[dir] = true
}
@@ -886,19 +916,30 @@ func (p *Provisioner) WriteFilesToContainer(ctx context.Context, containerID str
Name: name,
Mode: 0644,
Size: int64(len(data)),
Uid: AgentUID,
Gid: AgentGID,
}
if err := tw.WriteHeader(header); err != nil {
return fmt.Errorf("failed to write tar header for %s: %w", name, err)
return nil, fmt.Errorf("failed to write tar header for %s: %w", name, err)
}
if _, err := tw.Write(data); err != nil {
return fmt.Errorf("failed to write tar data for %s: %w", name, err)
return nil, fmt.Errorf("failed to write tar data for %s: %w", name, err)
}
}
if err := tw.Close(); err != nil {
return fmt.Errorf("failed to close tar writer: %w", err)
return nil, fmt.Errorf("failed to close tar writer: %w", err)
}
return &buf, nil
}
return p.cli.CopyToContainer(ctx, containerID, "/configs", &buf, container.CopyToContainerOptions{})
// WriteFilesToContainer writes in-memory files into /configs in the container,
// agent-owned (see buildConfigFilesTar).
func (p *Provisioner) WriteFilesToContainer(ctx context.Context, containerID string, files map[string][]byte) error {
buf, err := buildConfigFilesTar(files)
if err != nil {
return err
}
return p.cli.CopyToContainer(ctx, containerID, "/configs", buf, container.CopyToContainerOptions{})
}
// CopyToContainer exposes CopyToContainer from the Docker client for use by other packages.
@@ -988,13 +1029,28 @@ func (p *Provisioner) ReadFromVolume(ctx context.Context, volumeName, filePath s
return clean, nil
}
// writeAuthTokenVolumeCmd is the shell command the throwaway alpine container
// runs to seed /vol/.auth_token. alpine runs it as root, so without the
// explicit `chown 1000:1000` the file stays root:root after the template
// entrypoint's `chown -R agent:agent /configs` has already run — the agent-uid
// (AgentUID) MCP server then gets EACCES reading it → empty bearer →
// list_peers 401. Pulled out as a pure function so the ownership contract is
// unit-testable without a live Docker daemon. Issue #1877.
func writeAuthTokenVolumeCmd() string {
return fmt.Sprintf(
"mkdir -p /vol && printf '%%s' $TOKEN > /vol/.auth_token && chmod 0600 /vol/.auth_token && chown %d:%d /vol/.auth_token",
AgentUID, AgentGID,
)
}
// WriteAuthTokenToVolume writes the workspace auth token into the config volume
// BEFORE the container starts, eliminating the token-injection race window where
// a restarted container could read a stale token from /configs/.auth_token before
// WriteFilesToContainer writes the new one. Issue #1877.
//
// Uses a throwaway alpine container to write directly to the named volume,
// bypassing the container lifecycle entirely.
// bypassing the container lifecycle entirely. The written file is chowned to
// the agent uid/gid (see writeAuthTokenVolumeCmd).
func (p *Provisioner) WriteAuthTokenToVolume(ctx context.Context, workspaceID, token string) error {
if p == nil || p.cli == nil {
return ErrNoBackend
@@ -1002,7 +1058,7 @@ func (p *Provisioner) WriteAuthTokenToVolume(ctx context.Context, workspaceID, t
volName := ConfigVolumeName(workspaceID)
resp, err := p.cli.ContainerCreate(ctx, &container.Config{
Image: "alpine",
Cmd: []string{"sh", "-c", "mkdir -p /vol && printf '%s' $TOKEN > /vol/.auth_token && chmod 0600 /vol/.auth_token"},
Cmd: []string{"sh", "-c", writeAuthTokenVolumeCmd()},
Env: []string{"TOKEN=" + token},
}, &container.HostConfig{
Binds: []string{volName + ":/vol"},
@@ -0,0 +1,95 @@
package provisioner
import (
"archive/tar"
"errors"
"io"
"strings"
"testing"
)
// These tests pin the P0 fix for the fleet-wide list_peers 401 (Hermes and
// every other template): the workspace-server token-injection paths wrote
// /configs/.auth_token (and /configs/.platform_inbound_secret) as root:root
// AFTER the template entrypoint's `chown -R agent:agent /configs` ran, so the
// agent-uid (1000) MCP server (a2a_mcp_server, running via `gosu agent`) hit
// `[Errno 13] Permission denied` reading the bearer → empty bearer → platform
// 401 on /registry/{id}/peers (the literal tool_list_peers path).
//
// The agent uid is 1000:1000, verified from the templates:
// - workspace-configs-templates/claude-code-default/Dockerfile: `useradd -u 1000 ... agent`
// - workspace-configs-templates/hermes/Dockerfile: `useradd -u 1000 ... agent`
// - workspace/entrypoint.sh / claude-code-default/entrypoint.sh: `exec gosu agent` ("uid 1000")
//
// Both tests assert the real artifact (the tar headers Docker's CopyToContainer
// honours for ownership, and the literal shell command the throwaway alpine
// container runs), not a mock that bypasses ownership. They FAIL on pre-fix
// code (no Uid/Gid in tar headers; no chown in the alpine command → root:root)
// and PASS post-fix (agent-owned).
// TestWriteFilesToContainerTar_FilesAreAgentOwned covers the issue #418
// post-start re-injection path (WriteFilesToContainer): the tar it streams
// into /configs via CopyToContainer must carry Uid/Gid = agent (1000) so the
// extracted files land agent-readable, not root:root. This is the path that
// (re)writes BOTH .auth_token and .platform_inbound_secret on a cadence.
func TestWriteFilesToContainerTar_FilesAreAgentOwned(t *testing.T) {
files := map[string][]byte{
".auth_token": []byte("tok-abc123"),
".platform_inbound_secret": []byte("inbound-secret-xyz"),
"nested/dir/file.txt": []byte("data"),
}
buf, err := buildConfigFilesTar(files)
if err != nil {
t.Fatalf("buildConfigFilesTar: %v", err)
}
tr := tar.NewReader(buf)
seen := map[string]bool{}
for {
hdr, err := tr.Next()
if errors.Is(err, io.EOF) {
break
}
if err != nil {
t.Fatalf("read tar: %v", err)
}
if _, err := io.Copy(io.Discard, tr); err != nil {
t.Fatalf("drain %s: %v", hdr.Name, err)
}
seen[hdr.Name] = true
if hdr.Uid != AgentUID {
t.Fatalf("tar entry %q Uid = %d, want %d (agent) — root-owned injection causes the list_peers 401",
hdr.Name, hdr.Uid, AgentUID)
}
if hdr.Gid != AgentGID {
t.Fatalf("tar entry %q Gid = %d, want %d (agent)", hdr.Name, hdr.Gid, AgentGID)
}
}
for _, want := range []string{".auth_token", ".platform_inbound_secret"} {
if !seen[want] {
t.Fatalf("tar missing %q (seen: %v)", want, seen)
}
}
}
// TestWriteAuthTokenVolumeCmd_ChownsToAgent covers the issue #1877 pre-start
// volume-write path (WriteAuthTokenToVolume): the throwaway alpine container
// writes /vol/.auth_token then chmod 0600 but, pre-fix, never chowns it, so it
// stays root:root (alpine runs the command as root). The literal command must
// chown the file to the agent uid:gid so the agent-uid MCP server can read it.
func TestWriteAuthTokenVolumeCmd_ChownsToAgent(t *testing.T) {
cmd := writeAuthTokenVolumeCmd()
if !strings.Contains(cmd, "chmod 0600 /vol/.auth_token") {
t.Fatalf("alpine cmd lost the 0600 chmod (regression): %q", cmd)
}
wantChown := "chown 1000:1000 /vol/.auth_token"
if !strings.Contains(cmd, wantChown) {
t.Fatalf("alpine cmd = %q, missing %q — without it .auth_token stays root:root "+
"and the agent-uid MCP server gets EACCES → empty bearer → list_peers 401",
cmd, wantChown)
}
}