Fix telnet console silent-proxy hang on non-ConnectionError exit (#2344)#2669
Open
Fix telnet console silent-proxy hang on non-ConnectionError exit (#2344)#2669
Conversation
…3#2344) The run() cleanup block was guarded by `except (ConnectionError, OSError):`, so exits via asyncio.CancelledError or any other exception type skipped cleanup. Result: `_reader_process` stays pinned to the dead reader and `_get_reader()` returns None for every subsequent client — the silent-proxy symptom described in GNS3#2344. Convert the except block to try/finally so cleanup always runs, regardless of how `_process()` exits. Also: - catch asyncio.CancelledError + generic Exception (with log.exception) so unexpected failures don't swallow the cleanup - reset `_current_read = None` after cancellation - use `dict.pop(..., None)` instead of `del` to avoid KeyError races if the broadcast loop's timeout handler already removed the entry Triggering pattern observed in practice: a diagnostic tool opens a console, sends a few commands, and closes abruptly (e.g. from a test harness or orchestration script that cancels its Task). If the `_process()` task was awaiting on one of the `network_read` / `reader_read` futures at the time of cancellation, the CancelledError propagates up through `run()` and bypasses the ConnectionError-only except clause. The proxy accepts future connections (the listen socket is still alive) but never forwards any data because `_reader_process` never got reset. Validated against gns3/gns3-server:latest (2.2.56.1) running a 10-scenario sequential regression batch that previously hung reliably on the 4th sp_v1 / L3VPN scenario and now completes cleanly across all 10.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2344 — "Telnet console death issue" — in the 2.2 branch.
AsyncioTelnetServer.run()cleanup was guarded byexcept (ConnectionError, OSError):, so exits viaasyncio.CancelledErroror any other exception type skipped the cleanup and left_reader_processpinned to a dead reader. Every subsequent client got_get_reader() = None→ no data flow from the node → the exact silent-proxy symptom the issue describes.The existing fixes on 2.2 (
763ef241wait_for on drain;ffcfa4cccatch OSError;#2347) each addressed a different failure mode of the same symptom, but didn't cover theCancelledErrorpath. That's the specific case this PR handles.Change
Convert the
exceptblock totry/finallyso cleanup always runs regardless of how_process()exits:(ConnectionError, OSError, asyncio.CancelledError)and genericException(withlog.exception) so unexpected failures don't swallow cleanup._current_read = Noneafter cancellation so_get_reader()doesn't hand a dead future to the next reader-process owner.dict.pop(..., None)instead ofdelto avoidKeyErrorraces if the broadcast loop's timeout handler already removed the entry.Triggering pattern
Orchestration tool opens a console, runs a few
showcommands, closes abruptly (or the parent task is cancelled). If_process()was awaiting onnetwork_read/reader_readat the moment of cancellation, theCancelledErrorpropagates throughrun()and bypasses the ConnectionError-only except clause. The proxy continues toaccept()new connections (listen socket is fine) but never forwards data because_reader_processnever got reset.Validation
Validated against
gns3/gns3-server:latest(2.2.56.1) running a 10-scenario sequential regression batch against two GNS3 projects (7-node enterprise + 5-node SP-MPLS topology), with snapshot-restore between scenarios.Test harness that triggered the original hang: scripted Python telnet client doing enable + show + disconnect cycles with
asyncio.CancelledErrorpropagation, repeated across multiple console ports in quick succession.Notes
del self._connections[network_writer]→self._connections.pop(network_writer, None)is a defensive tweak motivated by the broadcast loop in_process()which can remove entries asynchronously.