Skip to content

[Bug]: Parallel file.managed states cause CPU spin-loop on Linux (fork-inherited ZeroMQ sockets) #68940

@co-cy

Description

@co-cy

What happened?

Description

When parallel: True is used on file.managed states, the forked ParallelState child processes enter a CPU spin-loop (~93–98% CPU each) and never complete. The root cause is that on Linux (fork-based platforms), call_parallel() passes the parent's State instance directly to the child process, including live ZeroMQ sockets connected to the Salt master. Multiple forked children then race on the same inherited ZeroMQ connections, causing the asyncio event loop to spin indefinitely waiting for responses that were already consumed by a sibling process.

Setup

  • Salt version: 3007.13 (Chlorine)
  • OS: Ubuntu 22.04 (Linux, fork-based process creation)
  • Master topology: 3-master failover (master_type: failover)
  • Transport: ZeroMQ

Steps to Reproduce

  1. Create a state with multiple file.managed declarations using parallel: True:
redis_exporter_binary:
  file.managed:
    - name: /usr/bin/redis_exporter
    - source: https://nexus.example.com/.../redis_exporter
    - skip_verify: True
    - parallel: True

redis_exporter_env:
  file.managed:
    - name: /etc/default/redis_exporter
    - source: salt://redis_exporter/files/redis_exporter.env.jinja
    - template: jinja
    - parallel: True

sentinel_exporter_env:
  file.managed:
    - name: /etc/default/sentinel_exporter
    - source: salt://redis_exporter/files/redis_exporter.env.jinja
    - template: jinja
    - parallel: True
  1. Run state.apply (even with test=True):
salt <minion-id> state.apply redis_exporter test=true
  1. Observe that the run never completes.

Expected Behavior

All file.managed states should execute in parallel and return results within seconds. With test=True, only hash comparison against the master file server should occur — no file writes.

Actual Behavior

  • The main state process logs "Started in a separate process" for each parallel state and then hangs indefinitely.
  • Two or more ParallelState(...) child processes appear, each consuming ~93–98% CPU:
root  2094575 98.1  1.9  610016 56528 ?  Sl  22:14  4:47  ...ParallelState(/etc/default/redis_exporter)
root  2094576 98.2  1.9  610016 56724 ?  Sl  22:14  4:47  ...ParallelState(/etc/default/sentinel_exporter)
  • The kernel stack for these processes shows futex_wait, but top reports near-100% CPU — characteristic of a userspace busy-loop (asyncio event loop spinning).
  • Network connections to the master (ports 4505/4506) are ESTABLISHED with Send-Q: 0 — sockets are open but idle.
  • salt-call cp.hash_file <same-file> works perfectly when run standalone (no parallelism).
  • salt-call cp.list_master works perfectly.
  • Killing the hung job with saltutil.kill_all_jobs and re-running produces the same result — the issue is 100% reproducible.

Root Cause Analysis

The issue is in salt/state.py, in the call_parallel() method (line ~2276):

def call_parallel(self, cdata, low, inject_globals):
    ...
    if salt.utils.platform.spawning_platform():
        instance = None          # Windows/macOS: will recreate State from scratch
    else:
        instance = self          # Linux: reuse parent's State object (with live sockets!)
        inject_globals = None
    
    proc = salt.utils.process.Process(
        target=self._call_parallel_target,
        args=(instance, self._init_kwargs, name, cdata, low, inject_globals),
        ...
    )
    proc.start()

On Linux, Process uses fork(). The child process inherits the parent's memory space, including:

  • ZeroMQ REQ sockets connected to the master's ret port (4506)
  • The asyncio event loop state
  • File client channel objects

When multiple children simultaneously call cp.hash_file (triggered by file.managed to compare file hashes), they all attempt to use the same inherited ZeroMQ socket to communicate with the master. ZeroMQ REQ sockets have strict request-reply ordering — if one child reads a response intended for another, the other child's event loop never receives its expected reply and spins indefinitely.

On Windows/macOS (spawning platforms), instance is set to None, and _call_parallel_target recreates the State object from scratch with fresh connections — which is why this bug is Linux-specific.

Suggested Fix

Force fresh State instance creation for parallel children on all platforms, not just spawning ones. The simplest approach:

def call_parallel(self, cdata, low, inject_globals):
    ...
    # Always create a fresh instance in child to avoid
    # sharing ZeroMQ sockets across forked processes
    instance = None

    proc = salt.utils.process.Process(
        target=self._call_parallel_target,
        args=(instance, self._init_kwargs, name, cdata, low, inject_globals),
        ...
    )
    proc.start()

This trades a small startup cost (recreating the State object per parallel child) for correctness. The current "optimization" of reusing the parent instance on Linux is unsafe whenever the state function communicates with the master.

An alternative approach would be to use multiprocessing.set_start_method('forkserver') or 'spawn' for parallel state processes specifically, but this would be a larger change.

Type of salt install

Official deb

Major version

3007.x

What supported OS are you seeing the problem on? Can select multiple. (If bug appears on an unsupported OS, please open a GitHub Discussion instead)

debian-11, debian-12

salt --versions-report output

salt --versions-report
Salt Version:
          Salt: 3007.13

Python Version:
        Python: 3.10.19 (main, Feb  5 2026, 07:05:38) [GCC 11.2.0]

Dependency Versions:
          cffi: 2.0.0
      cherrypy: unknown
  cryptography: 42.0.5
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.6
       libgit2: 1.9.1
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 24.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: 1.18.2
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.22.3
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.5.4
           ZMQ: 4.3.4

Salt Extensions:
 saltext.vault: 1.5.0

Salt Package Information:
  Package Type: onedir

System Versions:
          dist: debian 12.13 bookworm
        locale: utf-8
       machine: x86_64
       release: 6.12.73+deb12-amd64
        system: Linux
       version: Debian GNU/Linux 12.13 bookworm

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugbroken, incorrect, or confusing behaviorneeds-triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions