Skip to content

Listener pods crash-loop after upgrading to ARC v0.14.1 — scaleset v0.3.0 exits fatally on EOF from GitHub broker #4488

@jgutierrezglez

Description

@jgutierrezglez

Checks

Controller Version

0.14.1

Deployment Method

Helm

To Reproduce

  1. Deploy ARC v0.14.1 via Helm (gha-runner-scale-set-controller chart)
  2. Observe listener pods in the arc-systems namespace across one or more AutoscalingRunnerSet resources
  3. Within ~30 minutes, listener pods begin terminating with a non-zero exit code
  4. The ARC controller detects the termination and recreates the pod, which crashes again — producing a continuous crash-loop

Describe the bug

After upgrading ARC from v0.14.0 to v0.14.1, listener pods across multiple runner scale sets enter a continuous crash-loop. The controller repeatedly logs:

Listener pod is terminated  {"reason": "Error", "message": ""}

The listener exits with a non-zero status code when it receives an EOF response from the GitHub Actions broker service (broker.actions.githubusercontent.com). This was introduced by the scaleset library upgrade from v0.2.0 → v0.3.0, specifically the "Restore job acquisition flow" change. In v0.3.0, an EOF from the broker is treated as a fatal error instead of being handled gracefully with a retry.

The listener logs confirm scaleset@v0.3.0 is in use:

time=... level=INFO source=github.com/actions/scaleset@v0.3.0/listener/listener.go:179 msg="Getting next message" component=listener lastMessageID=<id>
[DEBUG] GET https://broker.actions.githubusercontent.com/scalesets/message?lastMessageId=<id>

After the listener exits, the ARC controller recreates the pod, which crashes again on the next EOF — resulting in an endless restart loop. From our controller logs, we observed 4 listener pods across 2 production clusters accumulating 36 error terminations within a ~40-minute window.

Note: scaleset v0.4.0 does not fix this. It addresses a different (message ordering) issue. The EOF-on-exit regression is still present in v0.4.0.

Describe the expected behavior

The listener should treat EOF responses from the GitHub broker as transient/recoverable conditions and retry — as it did in scaleset v0.2.0 (ARC v0.14.0). It should not exit with a non-zero status code on EOF.

Additional Context

# gha-runner-scale-set-controller values
image:
  tag: "0.14.1"

# gha-runner-scale-set values (representative — same pattern for all affected scale sets)
githubConfigUrl: "https://github.com/<org>"
minRunners: 0
maxRunners: <N>

Kubernetes server version: v1.35.2-eks-f69f56f

Controller Logs

https://gist.github.com/jgutierrezglez/c2889a9444c3efbe4d6b4908642d8430

Runner Pod Logs

The listener pod logs from the crashing containers are not recoverable. ARC's controller deletes and recreates the listener pod on each termination — once deleted, Kubernetes does not retain the previous container's logs for the new pod. The controller logs in the gist above are the primary evidence of the crash-loop behavior.

The currently-running listener pod logs (same ARC version, stable at the moment) are also included in the gist and confirm scaleset@v0.3.0 is in use, along with the polling loop pattern that crashes on EOF.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions