Skip to content

Guard report_shape_db_stats against :noproc during deployments#4142

Open
alco wants to merge 1 commit intomainfrom
fix/report-shape-db-stats-noproc
Open

Guard report_shape_db_stats against :noproc during deployments#4142
alco wants to merge 1 commit intomainfrom
fix/report-shape-db-stats-noproc

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented Apr 21, 2026

Summary

Electric.StackSupervisor.Telemetry.report_shape_db_stats/2 calls ShapeDb.statistics/1, which resolves the GenServer pid and then issues a GenServer.call. During stack restarts the underlying server can exit between those two steps, producing an uncaught :exit, :noproc from the telemetry poller and flooding Sentry with errors on every deployment.

The sibling function report_retained_wal_size already handles this race. This change mirrors its try/catch :exit, {:noproc, _} pattern. Surgical, no behavioural change on the happy path.

Refs electric-sql/stratovolt#1452.

Test plan

  • mix compile --warnings-as-errors is clean
  • mix test test/electric/stack_supervisor_test.exs passes

The telemetry poller calls ShapeDb.statistics/1, which resolves the
GenServer pid and then issues a GenServer.call. During stack restarts
the Statistics server can exit between those two steps, producing an
uncaught :noproc exit from the poller.

Mirror the existing try/catch pattern in report_retained_wal_size so
these transient races no longer surface as Sentry errors.

Refs electric-sql/stratovolt#1452
@alco alco added the claude label Apr 21, 2026
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude Code Review

Summary

This PR guards report_shape_db_stats/2 against a TOCTOU race where the ShapeDb.Statistics GenServer can exit between GenServer.whereis/1 and GenServer.call/2, causing uncaught :exit, :noproc exceptions that flood Sentry during deployments. The fix mirrors the existing pattern in the sibling report_retained_wal_size/3. The change is correct and surgical.

What's Working Well

  • The root cause diagnosis is accurate: Statistics.current/1 resolves the pid then calls it, creating a window for :noproc (packages/sync-service/lib/electric/shape_cache/shape_status/shape_db/statistics.ex:19-27).
  • The fix correctly mirrors report_retained_wal_size/3, using the same try/catch :exit, {:noproc, _} pattern (telemetry.ex:106-108).
  • report_write_buffer_size doesn't need the same treatment — pending_operations_count uses a direct ETS lookup with its own rescue ArgumentError (write_buffer.ex:157-161), not a GenServer call.
  • Changeset file is included and correctly scoped to @core/sync-service.

Issues Found

Suggestions (Nice to Have)

Inconsistent error containment vs. report_retained_wal_size

File: packages/sync-service/lib/electric/stack_supervisor/telemetry.ex:153-156

report_retained_wal_size/3 catches both :exit, {:noproc, _} and a broad type, reason -> fallback that logs a warning and returns :ok. report_shape_db_stats/2 only catches the :noproc case. A GenServer call timeout (:exit, {:timeout, {GenServer, :call, _}}) or an unexpected raise from OpenTelemetry.execute would still propagate out of the try block uncaught.

For a telemetry poller callback, silently swallowing (or logging) all errors is typically the right call — a reporting failure should never crash the poller. The existing pre-PR code also had this gap, so the PR doesn't regress anything, but it's worth making the two functions consistent:

catch
  :exit, {:noproc, _} ->
    :ok

  type, reason ->
    Logger.warning(
      "Failed to query shape DB stats\nError: #{Exception.format(type, reason)}",
      stack_id: stack_id
    )
end

Missing @spec annotation

report_retained_wal_size/3 has @spec report_retained_wal_size(Electric.stack_id(), binary(), map()) :: :ok. report_shape_db_stats/2 has none. Minor, but worth adding for consistency:

@spec report_shape_db_stats(Electric.stack_id(), map()) :: :ok

Issue Conformance

The PR references electric-sql/stratovolt#1452 (a private repo), so the linked issue isn't publicly visible. Based on the PR description, the fix is clearly scoped: match the :noproc guard pattern that already exists in report_retained_wal_size. No scope creep or missing pieces visible from the public side.


Review iteration: 1 | 2026-04-21

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.24%. Comparing base (365dd17) to head (a6819d5).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4142      +/-   ##
==========================================
+ Coverage   89.20%   89.24%   +0.03%     
==========================================
  Files          25       25              
  Lines        2520     2520              
  Branches      636      635       -1     
==========================================
+ Hits         2248     2249       +1     
+ Misses        270      269       -1     
  Partials        2        2              
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 94.36% <ø> (+0.05%) ⬆️
packages/y-electric 56.05% <ø> (ø)
typescript 89.24% <ø> (+0.03%) ⬆️
unit-tests 89.24% <ø> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants