What happened?
Description
When parallel: True is used on file.managed states, the forked ParallelState child processes enter a CPU spin-loop (~93–98% CPU each) and never complete. The root cause is that on Linux (fork-based platforms), call_parallel() passes the parent's State instance directly to the child process, including live ZeroMQ sockets connected to the Salt master. Multiple forked children then race on the same inherited ZeroMQ connections, causing the asyncio event loop to spin indefinitely waiting for responses that were already consumed by a sibling process.
Setup
- Salt version: 3007.13 (Chlorine)
- OS: Ubuntu 22.04 (Linux, fork-based process creation)
- Master topology: 3-master failover (
master_type: failover)
- Transport: ZeroMQ
Steps to Reproduce
- Create a state with multiple
file.managed declarations using parallel: True:
redis_exporter_binary:
file.managed:
- name: /usr/bin/redis_exporter
- source: https://nexus.example.com/.../redis_exporter
- skip_verify: True
- parallel: True
redis_exporter_env:
file.managed:
- name: /etc/default/redis_exporter
- source: salt://redis_exporter/files/redis_exporter.env.jinja
- template: jinja
- parallel: True
sentinel_exporter_env:
file.managed:
- name: /etc/default/sentinel_exporter
- source: salt://redis_exporter/files/redis_exporter.env.jinja
- template: jinja
- parallel: True
- Run
state.apply (even with test=True):
salt <minion-id> state.apply redis_exporter test=true
- Observe that the run never completes.
Expected Behavior
All file.managed states should execute in parallel and return results within seconds. With test=True, only hash comparison against the master file server should occur — no file writes.
Actual Behavior
- The main state process logs
"Started in a separate process" for each parallel state and then hangs indefinitely.
- Two or more
ParallelState(...) child processes appear, each consuming ~93–98% CPU:
root 2094575 98.1 1.9 610016 56528 ? Sl 22:14 4:47 ...ParallelState(/etc/default/redis_exporter)
root 2094576 98.2 1.9 610016 56724 ? Sl 22:14 4:47 ...ParallelState(/etc/default/sentinel_exporter)
- The kernel stack for these processes shows
futex_wait, but top reports near-100% CPU — characteristic of a userspace busy-loop (asyncio event loop spinning).
- Network connections to the master (ports 4505/4506) are
ESTABLISHED with Send-Q: 0 — sockets are open but idle.
salt-call cp.hash_file <same-file> works perfectly when run standalone (no parallelism).
salt-call cp.list_master works perfectly.
- Killing the hung job with
saltutil.kill_all_jobs and re-running produces the same result — the issue is 100% reproducible.
Root Cause Analysis
The issue is in salt/state.py, in the call_parallel() method (line ~2276):
def call_parallel(self, cdata, low, inject_globals):
...
if salt.utils.platform.spawning_platform():
instance = None # Windows/macOS: will recreate State from scratch
else:
instance = self # Linux: reuse parent's State object (with live sockets!)
inject_globals = None
proc = salt.utils.process.Process(
target=self._call_parallel_target,
args=(instance, self._init_kwargs, name, cdata, low, inject_globals),
...
)
proc.start()
On Linux, Process uses fork(). The child process inherits the parent's memory space, including:
- ZeroMQ
REQ sockets connected to the master's ret port (4506)
- The asyncio event loop state
- File client channel objects
When multiple children simultaneously call cp.hash_file (triggered by file.managed to compare file hashes), they all attempt to use the same inherited ZeroMQ socket to communicate with the master. ZeroMQ REQ sockets have strict request-reply ordering — if one child reads a response intended for another, the other child's event loop never receives its expected reply and spins indefinitely.
On Windows/macOS (spawning platforms), instance is set to None, and _call_parallel_target recreates the State object from scratch with fresh connections — which is why this bug is Linux-specific.
Suggested Fix
Force fresh State instance creation for parallel children on all platforms, not just spawning ones. The simplest approach:
def call_parallel(self, cdata, low, inject_globals):
...
# Always create a fresh instance in child to avoid
# sharing ZeroMQ sockets across forked processes
instance = None
proc = salt.utils.process.Process(
target=self._call_parallel_target,
args=(instance, self._init_kwargs, name, cdata, low, inject_globals),
...
)
proc.start()
This trades a small startup cost (recreating the State object per parallel child) for correctness. The current "optimization" of reusing the parent instance on Linux is unsafe whenever the state function communicates with the master.
An alternative approach would be to use multiprocessing.set_start_method('forkserver') or 'spawn' for parallel state processes specifically, but this would be a larger change.
Type of salt install
Official deb
Major version
3007.x
What supported OS are you seeing the problem on? Can select multiple. (If bug appears on an unsupported OS, please open a GitHub Discussion instead)
debian-11, debian-12
salt --versions-report output
salt --versions-report
Salt Version:
Salt: 3007.13
Python Version:
Python: 3.10.19 (main, Feb 5 2026, 07:05:38) [GCC 11.2.0]
Dependency Versions:
cffi: 2.0.0
cherrypy: unknown
cryptography: 42.0.5
dateutil: 2.8.2
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.6
libgit2: 1.9.1
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.7
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 24.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: 1.18.2
python-gnupg: 0.5.2
PyYAML: 6.0.1
PyZMQ: 25.1.2
relenv: 0.22.3
smmap: Not Installed
timelib: 0.3.0
Tornado: 6.5.4
ZMQ: 4.3.4
Salt Extensions:
saltext.vault: 1.5.0
Salt Package Information:
Package Type: onedir
System Versions:
dist: debian 12.13 bookworm
locale: utf-8
machine: x86_64
release: 6.12.73+deb12-amd64
system: Linux
version: Debian GNU/Linux 12.13 bookworm
What happened?
Description
When
parallel: Trueis used onfile.managedstates, the forkedParallelStatechild processes enter a CPU spin-loop (~93–98% CPU each) and never complete. The root cause is that on Linux (fork-based platforms),call_parallel()passes the parent'sStateinstance directly to the child process, including live ZeroMQ sockets connected to the Salt master. Multiple forked children then race on the same inherited ZeroMQ connections, causing the asyncio event loop to spin indefinitely waiting for responses that were already consumed by a sibling process.Setup
master_type: failover)Steps to Reproduce
file.manageddeclarations usingparallel: True:state.apply(even withtest=True):Expected Behavior
All
file.managedstates should execute in parallel and return results within seconds. Withtest=True, only hash comparison against the master file server should occur — no file writes.Actual Behavior
"Started in a separate process"for each parallel state and then hangs indefinitely.ParallelState(...)child processes appear, each consuming ~93–98% CPU:futex_wait, but top reports near-100% CPU — characteristic of a userspace busy-loop (asyncio event loop spinning).ESTABLISHEDwithSend-Q: 0— sockets are open but idle.salt-call cp.hash_file <same-file>works perfectly when run standalone (no parallelism).salt-call cp.list_masterworks perfectly.saltutil.kill_all_jobsand re-running produces the same result — the issue is 100% reproducible.Root Cause Analysis
The issue is in
salt/state.py, in thecall_parallel()method (line ~2276):On Linux,
Processusesfork(). The child process inherits the parent's memory space, including:REQsockets connected to the master'sretport (4506)When multiple children simultaneously call
cp.hash_file(triggered byfile.managedto compare file hashes), they all attempt to use the same inherited ZeroMQ socket to communicate with the master. ZeroMQREQsockets have strict request-reply ordering — if one child reads a response intended for another, the other child's event loop never receives its expected reply and spins indefinitely.On Windows/macOS (spawning platforms),
instanceis set toNone, and_call_parallel_targetrecreates theStateobject from scratch with fresh connections — which is why this bug is Linux-specific.Suggested Fix
Force fresh
Stateinstance creation for parallel children on all platforms, not just spawning ones. The simplest approach:This trades a small startup cost (recreating the State object per parallel child) for correctness. The current "optimization" of reusing the parent instance on Linux is unsafe whenever the state function communicates with the master.
An alternative approach would be to use
multiprocessing.set_start_method('forkserver')or'spawn'for parallel state processes specifically, but this would be a larger change.Type of salt install
Official deb
Major version
3007.x
What supported OS are you seeing the problem on? Can select multiple. (If bug appears on an unsupported OS, please open a GitHub Discussion instead)
debian-11, debian-12
salt --versions-report output
salt --versions-report Salt Version: Salt: 3007.13 Python Version: Python: 3.10.19 (main, Feb 5 2026, 07:05:38) [GCC 11.2.0] Dependency Versions: cffi: 2.0.0 cherrypy: unknown cryptography: 42.0.5 dateutil: 2.8.2 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.6 libgit2: 1.9.1 looseversion: 1.3.0 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.7 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 24.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.19.1 pygit2: 1.18.2 python-gnupg: 0.5.2 PyYAML: 6.0.1 PyZMQ: 25.1.2 relenv: 0.22.3 smmap: Not Installed timelib: 0.3.0 Tornado: 6.5.4 ZMQ: 4.3.4 Salt Extensions: saltext.vault: 1.5.0 Salt Package Information: Package Type: onedir System Versions: dist: debian 12.13 bookworm locale: utf-8 machine: x86_64 release: 6.12.73+deb12-amd64 system: Linux version: Debian GNU/Linux 12.13 bookworm