Fix telnet console silent-proxy hang on non-ConnectionError exit (#2344) by arl1984 · Pull Request #2669 · GNS3/gns3-server

arl1984 · 2026-04-16T14:47:20Z

Summary

Fixes #2344 — "Telnet console death issue" — in the 2.2 branch.

AsyncioTelnetServer.run() cleanup was guarded by except (ConnectionError, OSError):, so exits via asyncio.CancelledError or any other exception type skipped the cleanup and left _reader_process pinned to a dead reader. Every subsequent client got _get_reader() = None → no data flow from the node → the exact silent-proxy symptom the issue describes.

The existing fixes on 2.2 (763ef241 wait_for on drain; ffcfa4cc catch OSError; #2347) each addressed a different failure mode of the same symptom, but didn't cover the CancelledError path. That's the specific case this PR handles.

Change

Convert the except block to try/finally so cleanup always runs regardless of how _process() exits:

Catch (ConnectionError, OSError, asyncio.CancelledError) and generic Exception (with log.exception) so unexpected failures don't swallow cleanup.
Reset _current_read = None after cancellation so _get_reader() doesn't hand a dead future to the next reader-process owner.
Use dict.pop(..., None) instead of del to avoid KeyError races if the broadcast loop's timeout handler already removed the entry.

Triggering pattern

Orchestration tool opens a console, runs a few show commands, closes abruptly (or the parent task is cancelled). If _process() was awaiting on network_read / reader_read at the moment of cancellation, the CancelledError propagates through run() and bypasses the ConnectionError-only except clause. The proxy continues to accept() new connections (listen socket is fine) but never forwards data because _reader_process never got reset.

Validation

Validated against gns3/gns3-server:latest (2.2.56.1) running a 10-scenario sequential regression batch against two GNS3 projects (7-node enterprise + 5-node SP-MPLS topology), with snapshot-restore between scenarios.

Before this patch: reliably hung on the 4th sp_v1 scenario and required node-level stop/start or full project restart to recover. All prior batches had to be split into individual-scenario retries.
After this patch: first-ever clean 10-scenario sequential run, 66 min wall clock, zero hangs. Then the same batch runs successfully across container restarts (patch re-applied via entrypoint hook so recreates still get it).

Test harness that triggered the original hang: scripted Python telnet client doing enable + show + disconnect cycles with asyncio.CancelledError propagation, repeated across multiple console ports in quick succession.

Notes

No test changes because the existing telnet_server test surface is small and this is a cleanup-path-ordering fix that's hard to express as a focused unit test without heavy asyncio mocking. Happy to add one if you'd prefer.
The 3.0 branch uses telnetlib3 (PR Enhancement: Re-write current Telnet server implementation using telnetlib3 library #2645) so this specific code path doesn't apply there.
del self._connections[network_writer] → self._connections.pop(network_writer, None) is a defensive tweak motivated by the broadcast loop in _process() which can remove entries asynchronously.

…3#2344) The run() cleanup block was guarded by `except (ConnectionError, OSError):`, so exits via asyncio.CancelledError or any other exception type skipped cleanup. Result: `_reader_process` stays pinned to the dead reader and `_get_reader()` returns None for every subsequent client — the silent-proxy symptom described in GNS3#2344. Convert the except block to try/finally so cleanup always runs, regardless of how `_process()` exits. Also: - catch asyncio.CancelledError + generic Exception (with log.exception) so unexpected failures don't swallow the cleanup - reset `_current_read = None` after cancellation - use `dict.pop(..., None)` instead of `del` to avoid KeyError races if the broadcast loop's timeout handler already removed the entry Triggering pattern observed in practice: a diagnostic tool opens a console, sends a few commands, and closes abruptly (e.g. from a test harness or orchestration script that cancels its Task). If the `_process()` task was awaiting on one of the `network_read` / `reader_read` futures at the time of cancellation, the CancelledError propagates up through `run()` and bypasses the ConnectionError-only except clause. The proxy accepts future connections (the listen socket is still alive) but never forwards any data because `_reader_process` never got reset. Validated against gns3/gns3-server:latest (2.2.56.1) running a 10-scenario sequential regression batch that previously hung reliably on the 4th sp_v1 / L3VPN scenario and now completes cleanly across all 10.

arl1984 mentioned this pull request Apr 16, 2026

Telnet console death issue #2344

Open

grossmj added this to GNS3 tracking Apr 16, 2026

github-project-automation Bot moved this to Triage in GNS3 tracking Apr 16, 2026

grossmj moved this from Triage to In Progress in GNS3 tracking Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix telnet console silent-proxy hang on non-ConnectionError exit (#2344)#2669

Fix telnet console silent-proxy hang on non-ConnectionError exit (#2344)#2669
arl1984 wants to merge 1 commit intoGNS3:2.2from
arl1984:fix/telnet-proxy-reader-cleanup-2344

arl1984 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arl1984 commented Apr 16, 2026

Summary

Change

Triggering pattern

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants