Skip to content

metrics: export grpc stats metrics for batch client#1934

Open
zyguan wants to merge 3 commits intotikv:masterfrom
zyguan:dev/grpc-stats
Open

metrics: export grpc stats metrics for batch client#1934
zyguan wants to merge 3 commits intotikv:masterfrom
zyguan:dev/grpc-stats

Conversation

@zyguan
Copy link
Copy Markdown
Contributor

@zyguan zyguan commented Mar 31, 2026

This PR implements gRPC statistics monitoring for the BatchCommands streaming RPC used in TiKV's batch client. It adds a new batchClientStatsMonitor type that implements the stats.Handler interface from gRPC.

The monitor tracks per-stream metrics (send/receive timestamps) and per-target metrics (wire bytes sent/received, delayed pick count, transparent retry count). When a new stream is created, it assigns a unique global ID to track that stream's lifecycle. Metrics are emitted to Prometheus gauges and counters, and properly cleaned up when streams end.

This enables better observability for BatchCommands streams, allowing operators to monitor send/receive timing, traffic volume, and retry behavior.

Summary by CodeRabbit

  • New Features

    • Enhanced gRPC BatchCommands observability: added per-stream and per-target metrics for last send/receive times, send/receive wire bytes, delayed picks, and transparent retry attempts. Per-stream time labels are removed when a stream ends, and monitoring is enabled for batch connections.
  • Tests

    • Added tests validating distinct stream identifiers, lifecycle handling, metric updates for RPC events, and cleanup of per-stream labels.

Signed-off-by: zyguan <zhongyangguan@gmail.com>
@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c995e73d-326e-4929-9453-3a18f58902b1

📥 Commits

Reviewing files that changed from the base of the PR and between a10f135 and cd72df1.

📒 Files selected for processing (3)
  • internal/client/conn_monitor.go
  • internal/client/conn_monitor_test.go
  • internal/client/conn_pool.go
✅ Files skipped from review due to trivial changes (1)
  • internal/client/conn_pool.go

📝 Walkthrough

Walkthrough

Adds a gRPC client-side stats.Handler for the TiKV BatchCommands RPC that assigns per-RPC stream IDs, updates per-target and per-stream Prometheus metrics (send/recv times, wire bytes, delayed picks, transparent-retries), registers new metrics, and integrates the handler into batch-enabled connection dialing with tests.

Changes

Cohort / File(s) Summary
gRPC stats handler
internal/client/conn_monitor.go
New batchClientStatsMonitor implementing grpc/stats.Handler: generates a global stream ID in TagRPC, stores it in context, updates metrics on Begin/OutPayload/InPayload/DelayedPickComplete, and removes per-stream gauge entries on End. Uses sync.Map and sync.Once for per-stream metric handles.
Handler tests
internal/client/conn_monitor_test.go
New tests and helpers to reset/read Prometheus metrics; verifies unique stream ID assignment, metric updates for delayed pick, transparent retry, payload send/recv (including timestamp conversion), and cleanup of per-stream gauges after End.
Connection initialization
internal/client/conn_pool.go
When batch is enabled, create batchClientStatsMonitor and append grpc.WithStatsHandler(batchStatsMonitor) to dial options; changed to use a local opts slice when building dial options.
Prometheus metrics
metrics/metrics.go
Added exported metrics: TiKVBatchCommandsSendTime, TiKVBatchCommandsRecvTime (GaugeVec with labels target, stream) and counters TiKVBatchCommandsSendWireBytes, TiKVBatchCommandsRecvWireBytes, TiKVBatchCommandsDelayedPick, TiKVBatchCommandsTransparentRetry. Added LblStream constant and registered metrics.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant GRPC as "gRPC runtime"
    participant Handler as "batchClientStatsMonitor"
    participant Metrics as "Prometheus"

    Client->>Handler: TagRPC(full method = /tikvpb.Tikv/BatchCommands)
    Handler->>Handler: allocate global stream ID, attach to context
    Handler-->>Client: return context with stream ID

    GRPC->>Handler: DelayedPickComplete
    Handler->>Metrics: increment DelayedPick counter (target)

    GRPC->>Handler: Begin (IsTransparentRetryAttempt=true)
    Handler->>Metrics: increment TransparentRetry counter (target)

    GRPC->>Handler: OutPayload (WireLength, SentTime)
    Handler->>Metrics: add to SendWireBytes (target)
    Handler->>Metrics: set SendTime (target, stream)

    GRPC->>Handler: InPayload (WireLength, RecvTime)
    Handler->>Metrics: add to RecvWireBytes (target)
    Handler->>Metrics: set RecvTime (target, stream)

    GRPC->>Handler: End
    Handler->>Metrics: delete SendTime/RecvTime label entries for (target, stream)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped through streams and gave each one a name,

Counting bytes and moments, I logged them without shame,
Retries and delays, I nibbled and stored,
Gauges and counters now hum and accord,
A rabbit applauds—metrics tidy and tame.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: exporting gRPC stats metrics for the batch client, which is the core purpose of adding the batchClientStatsMonitor and six new Prometheus metrics.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/client/conn_monitor.go`:
- Around line 129-135: The metrics use whole-second precision via
ev.SentTime.Unix()/ev.RecvTime.Unix(), causing precision loss; update
batchClientStreamMetrics.onOutPayload and onInPayload to compute seconds with
sub-second precision by converting UnixNano() to float64 seconds (divide
UnixNano by 1e9 or by float64(time.Second)) and set sendTime/recvTime with that
value; ensure the time package is imported if needed and reference
stats.OutPayload, stats.InPayload, sendTime, recvTime, onOutPayload and
onInPayload when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 10c4c5e0-d3b9-4dea-8cf3-897c8cf24ac3

📥 Commits

Reviewing files that changed from the base of the PR and between 3805cb7 and f4370e5.

📒 Files selected for processing (4)
  • internal/client/conn_monitor.go
  • internal/client/conn_monitor_test.go
  • internal/client/conn_pool.go
  • metrics/metrics.go

Comment thread internal/client/conn_monitor.go
zyguan added 2 commits March 31, 2026 15:35
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
if i.FullMethodName != batchCommandsMethod {
return ctx
}
return context.WithValue(ctx, batchClientStreamKey{}, atomic.AddUint64(&globalBatchClientStreamID, 1))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we associate a conn index with streamID ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can merge #1931 first since the concept of conn index is introduce by #1931.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 8, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 8, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-08 10:33:50.131789317 +0000 UTC m=+952435.337149374: ☑️ agreed by lcwangchao.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 8, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lcwangchao
Once this PR has been reviewed and has the lgtm label, please assign you06 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zyguan
Copy link
Copy Markdown
Contributor Author

zyguan commented Apr 8, 2026

/hold

@ti-chi-bot ti-chi-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants