fix: make AgentOS SSE streaming robust to serialization failures#7564
fix: make AgentOS SSE streaming robust to serialization failures#7564
Conversation
format_sse_event had an `except json.JSONDecodeError` clause that never triggered (to_json emits JSON, it does not parse it), so any real serialization error (TypeError/ValueError from a non-JSON-serializable field on an event) propagated out of the streaming generator. Starlette then closed the socket without a terminating chunk, producing "ASGI callable returned without complete response" server-side and partial transfers client-side. Catch Exception, log the offending event class so the culprit shows up in production logs, and emit a valid SSE RunError frame so the response completes cleanly.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1aff46909a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "event": "RunError", | ||
| "content": f"Failed to serialize {type(event).__name__}: {e}", |
There was a problem hiding this comment.
Preserve error event family in SSE fallback payload
When serialization fails, the fallback always emits "event": "RunError" even for team/workflow streams, which changes the event contract from TeamRunError/WorkflowError to an agent-style error. This is observable in downstream parsers and handlers that branch on workflow/team error event names, so a workflow/team serialization failure can be misclassified and skip the normal terminal-error handling for that stream type. The fallback should derive the error event name from the original stream family instead of hardcoding RunError.
Useful? React with 👍 / 👎.
Codex review flagged that the serialization-failure fallback hardcoded event='RunError', which is the agent terminal-error name. Team and workflow streamers share this helper, so a serialization failure on those streams would be misclassified as an agent error and skip the terminal-error handling that downstream parsers branch on (TeamRunError, WorkflowError). Derive the fallback event name from the event's base class so each stream family surfaces under its own contract.
Summary
A user reported
ASGI callable returned without complete responseserver-side, andtransfer failed / partial filein Postman, when hitting AgentOS streaming endpoints in production. Works fine withstream=False.Root cause is in
libs/agno/agno/os/utils.py::format_sse_event:to_json()emits JSON, it does not parse it, so it never raisesjson.JSONDecodeError. A real failure (TypeError/ValueErrorfrom a non-JSON-serializable field inside a run event - e.g. bytes, numpy arrays, custom objects embedded in a tool result or content event) propagates out of the streaming generator. Starlette then closes the socket without a terminating chunk, which the ASGI server logs asASGI callable returned without complete responseand Postman surfaces as a partial transfer.This PR:
Exceptioninstead of the never-firingJSONDecodeError.log_errorso the culprit shows up in the user's production logs on the next failure.RunErrorframe so the streaming response always completes cleanly.event_typefallback withgetattr(..., "event", None) or "message".Tests added in
libs/agno/tests/unit/os/test_utils.py:to_jsonraises must produce a validRunErrorSSE frame rather than propagating the exception.eventattribute fall back to"message".Type of change
Checklist
./scripts/format.shand./scripts/validate.sh)Duplicate and AI-Generated PR Check
Additional Notes
The bug has existed since
6078ca9a4(release 2.3.9, Dec 9 2025), so it is not a regression from a specific recent branch - it only surfaces when a run event happens to contain a non-JSON-serializable field, which is data-dependent.Follow-up for whoever picks up the user's report: once this deploys, their AgentOS logs will show
Failed to serialize SSE event <ClassName> (event_type=...): <error>on the next failing run, pinpointing the exact event class and field causing the serialization error.