Skip to content

fix: make AgentOS SSE streaming robust to serialization failures#7564

Open
ysolanky wants to merge 3 commits intomainfrom
fix/agentos-sse-streaming-robustness
Open

fix: make AgentOS SSE streaming robust to serialization failures#7564
ysolanky wants to merge 3 commits intomainfrom
fix/agentos-sse-streaming-robustness

Conversation

@ysolanky
Copy link
Copy Markdown
Member

Summary

A user reported ASGI callable returned without complete response server-side, and transfer failed / partial file in Postman, when hitting AgentOS streaming endpoints in production. Works fine with stream=False.

Root cause is in libs/agno/agno/os/utils.py::format_sse_event:

try:
    clean_json = event.to_json(separators=(",", ":"), indent=None)
    return f"event: {event_type}\ndata: {clean_json}\n\n"
except json.JSONDecodeError:
    clean_json = event.to_json(separators=(",", ":"), indent=None)  # same call, will raise again
    return f"event: message\ndata: {clean_json}\n\n"

to_json() emits JSON, it does not parse it, so it never raises json.JSONDecodeError. A real failure (TypeError/ValueError from a non-JSON-serializable field inside a run event - e.g. bytes, numpy arrays, custom objects embedded in a tool result or content event) propagates out of the streaming generator. Starlette then closes the socket without a terminating chunk, which the ASGI server logs as ASGI callable returned without complete response and Postman surfaces as a partial transfer.

This PR:

  • Catches Exception instead of the never-firing JSONDecodeError.
  • Logs the offending event class and event type via log_error so the culprit shows up in the user's production logs on the next failure.
  • Emits a valid SSE RunError frame so the streaming response always completes cleanly.
  • Hardens the event_type fallback with getattr(..., "event", None) or "message".

Tests added in libs/agno/tests/unit/os/test_utils.py:

  • Happy-path frame format.
  • Regression guard: an event whose to_json raises must produce a valid RunError SSE frame rather than propagating the exception.
  • Events missing an event attribute fall back to "message".

Type of change

  • Bug fix
  • New feature
  • Breaking change
  • Improvement
  • Model update
  • Other:

Checklist

  • Code complies with style guidelines
  • Ran format/validation scripts (./scripts/format.sh and ./scripts/validate.sh)
  • Self-review completed
  • Documentation updated (comments, docstrings)
  • Examples and guides: Relevant cookbook examples have been included or updated (if applicable)
  • Tested in clean environment
  • Tests added/updated (if applicable)

Duplicate and AI-Generated PR Check

  • I have searched existing open pull requests and confirmed that no other PR already addresses this issue
  • If a similar PR exists, I have explained below why this PR is a better approach
  • Check if this PR was entirely AI-generated (by Copilot, Claude Code, Cursor, etc.)

Additional Notes

The bug has existed since 6078ca9a4 (release 2.3.9, Dec 9 2025), so it is not a regression from a specific recent branch - it only surfaces when a run event happens to contain a non-JSON-serializable field, which is data-dependent.

Follow-up for whoever picks up the user's report: once this deploys, their AgentOS logs will show Failed to serialize SSE event <ClassName> (event_type=...): <error> on the next failing run, pinpointing the exact event class and field causing the serialization error.

format_sse_event had an `except json.JSONDecodeError` clause that never
triggered (to_json emits JSON, it does not parse it), so any real
serialization error (TypeError/ValueError from a non-JSON-serializable
field on an event) propagated out of the streaming generator. Starlette
then closed the socket without a terminating chunk, producing "ASGI
callable returned without complete response" server-side and partial
transfers client-side.

Catch Exception, log the offending event class so the culprit shows up
in production logs, and emit a valid SSE RunError frame so the response
completes cleanly.
@ysolanky ysolanky requested a review from a team as a code owner April 17, 2026 14:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1aff46909a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread libs/agno/agno/os/utils.py Outdated
Comment on lines +201 to +202
"event": "RunError",
"content": f"Failed to serialize {type(event).__name__}: {e}",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve error event family in SSE fallback payload

When serialization fails, the fallback always emits "event": "RunError" even for team/workflow streams, which changes the event contract from TeamRunError/WorkflowError to an agent-style error. This is observable in downstream parsers and handlers that branch on workflow/team error event names, so a workflow/team serialization failure can be misclassified and skip the normal terminal-error handling for that stream type. The fallback should derive the error event name from the original stream family instead of hardcoding RunError.

Useful? React with 👍 / 👎.

ysolanky and others added 2 commits April 20, 2026 16:28
Codex review flagged that the serialization-failure fallback hardcoded
event='RunError', which is the agent terminal-error name. Team and
workflow streamers share this helper, so a serialization failure on
those streams would be misclassified as an agent error and skip the
terminal-error handling that downstream parsers branch on (TeamRunError,
WorkflowError).

Derive the fallback event name from the event's base class so each
stream family surfaces under its own contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant