Skip to content

Fix telnet console silent-proxy hang on non-ConnectionError exit (#2344)#2669

Open
arl1984 wants to merge 1 commit intoGNS3:2.2from
arl1984:fix/telnet-proxy-reader-cleanup-2344
Open

Fix telnet console silent-proxy hang on non-ConnectionError exit (#2344)#2669
arl1984 wants to merge 1 commit intoGNS3:2.2from
arl1984:fix/telnet-proxy-reader-cleanup-2344

Conversation

@arl1984
Copy link
Copy Markdown

@arl1984 arl1984 commented Apr 16, 2026

Summary

Fixes #2344 — "Telnet console death issue" — in the 2.2 branch.

AsyncioTelnetServer.run() cleanup was guarded by except (ConnectionError, OSError):, so exits via asyncio.CancelledError or any other exception type skipped the cleanup and left _reader_process pinned to a dead reader. Every subsequent client got _get_reader() = None → no data flow from the node → the exact silent-proxy symptom the issue describes.

The existing fixes on 2.2 (763ef241 wait_for on drain; ffcfa4cc catch OSError; #2347) each addressed a different failure mode of the same symptom, but didn't cover the CancelledError path. That's the specific case this PR handles.

Change

Convert the except block to try/finally so cleanup always runs regardless of how _process() exits:

  • Catch (ConnectionError, OSError, asyncio.CancelledError) and generic Exception (with log.exception) so unexpected failures don't swallow cleanup.
  • Reset _current_read = None after cancellation so _get_reader() doesn't hand a dead future to the next reader-process owner.
  • Use dict.pop(..., None) instead of del to avoid KeyError races if the broadcast loop's timeout handler already removed the entry.

Triggering pattern

Orchestration tool opens a console, runs a few show commands, closes abruptly (or the parent task is cancelled). If _process() was awaiting on network_read / reader_read at the moment of cancellation, the CancelledError propagates through run() and bypasses the ConnectionError-only except clause. The proxy continues to accept() new connections (listen socket is fine) but never forwards data because _reader_process never got reset.

Validation

Validated against gns3/gns3-server:latest (2.2.56.1) running a 10-scenario sequential regression batch against two GNS3 projects (7-node enterprise + 5-node SP-MPLS topology), with snapshot-restore between scenarios.

  • Before this patch: reliably hung on the 4th sp_v1 scenario and required node-level stop/start or full project restart to recover. All prior batches had to be split into individual-scenario retries.
  • After this patch: first-ever clean 10-scenario sequential run, 66 min wall clock, zero hangs. Then the same batch runs successfully across container restarts (patch re-applied via entrypoint hook so recreates still get it).

Test harness that triggered the original hang: scripted Python telnet client doing enable + show + disconnect cycles with asyncio.CancelledError propagation, repeated across multiple console ports in quick succession.

Notes

  • No test changes because the existing telnet_server test surface is small and this is a cleanup-path-ordering fix that's hard to express as a focused unit test without heavy asyncio mocking. Happy to add one if you'd prefer.
  • The 3.0 branch uses telnetlib3 (PR Enhancement: Re-write current Telnet server implementation using telnetlib3 library #2645) so this specific code path doesn't apply there.
  • del self._connections[network_writer]self._connections.pop(network_writer, None) is a defensive tweak motivated by the broadcast loop in _process() which can remove entries asynchronously.

…3#2344)

The run() cleanup block was guarded by `except (ConnectionError, OSError):`,
so exits via asyncio.CancelledError or any other exception type skipped
cleanup. Result: `_reader_process` stays pinned to the dead reader and
`_get_reader()` returns None for every subsequent client — the silent-proxy
symptom described in GNS3#2344.

Convert the except block to try/finally so cleanup always runs, regardless
of how `_process()` exits. Also:
- catch asyncio.CancelledError + generic Exception (with log.exception) so
  unexpected failures don't swallow the cleanup
- reset `_current_read = None` after cancellation
- use `dict.pop(..., None)` instead of `del` to avoid KeyError races if
  the broadcast loop's timeout handler already removed the entry

Triggering pattern observed in practice: a diagnostic tool opens a console,
sends a few commands, and closes abruptly (e.g. from a test harness or
orchestration script that cancels its Task). If the `_process()` task was
awaiting on one of the `network_read` / `reader_read` futures at the time
of cancellation, the CancelledError propagates up through `run()` and
bypasses the ConnectionError-only except clause. The proxy accepts future
connections (the listen socket is still alive) but never forwards any data
because `_reader_process` never got reset.

Validated against gns3/gns3-server:latest (2.2.56.1) running a 10-scenario
sequential regression batch that previously hung reliably on the 4th
sp_v1 / L3VPN scenario and now completes cleanly across all 10.
@grossmj grossmj moved this from Triage to In Progress in GNS3 tracking Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants