Skip to content

feat(query-backend): hedge GetRange calls to reduce symbol fetch tail latency#4976

Open
simonswine wants to merge 1 commit intografana:mainfrom
simonswine:20260402_query-backend-hedge-getrange
Open

feat(query-backend): hedge GetRange calls to reduce symbol fetch tail latency#4976
simonswine wants to merge 1 commit intografana:mainfrom
simonswine:20260402_query-backend-hedge-getrange

Conversation

@simonswine
Copy link
Copy Markdown
Contributor

@simonswine simonswine commented Apr 2, 2026

Summary

Symbol table fetches (rawTable[T].fetch, stacktraceBlock.fetch) issue GetRange calls against object storage and are the primary tail latency hotspot on the query backend read path. A single slow fetch stalls the entire fetchTx for a block.

This PR adds opt-in speculative hedging: after --query-backend.block-read-hedge-after, a second parallel GetRange is issued against the same object. Whichever responds first is decoded; the other is cancelled and its connection closed.

Design decisions:

  • Hedge only GetRange, not the decode — the winning RC is decoded once (no duplicate CPU work)
  • New Cleanup func(T) field on retry.Hedged closes the losing io.ReadCloser to prevent connection leaks when both calls succeed (see test(retry): use synctest for deterministic hedging tests #4975 for the synctest test cleanup that's a prerequisite)
  • Disabled by default (hedgeAfter=0); opt-in per deployment

Changes

  • pkg/util/retry/hedged.go — add Cleanup func(T) field; called on a losing attempt's result when it succeeded but the other won
  • pkg/phlaredb/symdb/block_reader.go — add hedgeAfter to Reader; extract getRange() helpers on rawTable[T] and stacktraceBlock that wrap GetRange with retry.Hedged when configured
  • pkg/block/object.go — add hedgeAfter field + WithObjectHedgeAfter option
  • pkg/block/section_symbols.go — thread hedgeAfter into symdb.OpenObject
  • pkg/querybackend/block_reader.go — add WithBlockReaderHedgeAfter option; pass to block.NewObject
  • pkg/querybackend/backend.go — add BlockReadHedgeAfter config field + flag

Depends on

Test plan

  • go test ./pkg/util/retry/... ./pkg/phlaredb/symdb/... ./pkg/querybackend/...
  • Enable with --query-backend.block-read-hedge-after=2s and verify GetRange is issued twice for slow fetches

Note

Medium Risk
Introduces new speculative parallel GetRange calls and changes retry.Hedged cancellation/cleanup semantics, which can affect concurrency behavior and resource usage under load. Disabled by default but could increase object-store traffic when enabled and needs careful tuning.

Overview
Adds opt-in hedged object-store reads to reduce tail latency when fetching symbol tables. A new query-backend.block-read-hedge-after setting is threaded from querybackend.Config into BlockReader/block.Object and down to symdb.OpenObject, enabling a second speculative GetRange after the configured delay.

pkg/phlaredb/symdb now wraps symbol table and stacktrace GetRange reads with retry.Hedged (including closing the losing io.ReadCloser). retry.Hedged itself is updated to use per-attempt contexts (so the winning result remains usable) and adds a Cleanup hook, with new tests covering winner-context behavior and cleanup semantics.

Reviewed by Cursor Bugbot for commit 1a87ad3. Bugbot is set up for automated code reviews on this repo. Configure here.

@simonswine simonswine force-pushed the 20260402_query-backend-hedge-getrange branch 2 times, most recently from 6c62bde to a432037 Compare April 2, 2026 15:16
@simonswine simonswine marked this pull request as ready for review April 3, 2026 08:55
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

Comment thread pkg/util/retry/hedged.go
Symbol table fetches (locations, mappings, functions, strings,
stacktraces) call GetRange against object storage and are a primary
source of tail latency on the read path. This adds opt-in speculative
hedging: after a configurable delay, a second parallel GetRange is
issued; whichever response arrives first is used and the other is
cancelled.

The hedge wraps only the GetRange call, not the decode — the winning
response body is decoded once. A new Cleanup field on retry.Hedged
ensures the losing response body is always closed, preventing connection
leaks when both calls succeed.

Config flag (default 0 = disabled):
  --query-backend.block-read-hedge-after=<duration>
@simonswine simonswine force-pushed the 20260402_query-backend-hedge-getrange branch from a432037 to 1a87ad3 Compare April 7, 2026 08:07
Copy link
Copy Markdown
Contributor

@aleks-p aleks-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants