feat(query-backend): hedge GetRange calls to reduce symbol fetch tail latency#4976
Open
simonswine wants to merge 1 commit intografana:mainfrom
Open
feat(query-backend): hedge GetRange calls to reduce symbol fetch tail latency#4976simonswine wants to merge 1 commit intografana:mainfrom
simonswine wants to merge 1 commit intografana:mainfrom
Conversation
6c62bde to
a432037
Compare
Symbol table fetches (locations, mappings, functions, strings, stacktraces) call GetRange against object storage and are a primary source of tail latency on the read path. This adds opt-in speculative hedging: after a configurable delay, a second parallel GetRange is issued; whichever response arrives first is used and the other is cancelled. The hedge wraps only the GetRange call, not the decode — the winning response body is decoded once. A new Cleanup field on retry.Hedged ensures the losing response body is always closed, preventing connection leaks when both calls succeed. Config flag (default 0 = disabled): --query-backend.block-read-hedge-after=<duration>
a432037 to
1a87ad3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Symbol table fetches (
rawTable[T].fetch,stacktraceBlock.fetch) issueGetRangecalls against object storage and are the primary tail latency hotspot on the query backend read path. A single slow fetch stalls the entirefetchTxfor a block.This PR adds opt-in speculative hedging: after
--query-backend.block-read-hedge-after, a second parallelGetRangeis issued against the same object. Whichever responds first is decoded; the other is cancelled and its connection closed.Design decisions:
GetRange, not the decode — the winning RC is decoded once (no duplicate CPU work)Cleanup func(T)field onretry.Hedgedcloses the losingio.ReadCloserto prevent connection leaks when both calls succeed (see test(retry): use synctest for deterministic hedging tests #4975 for thesynctesttest cleanup that's a prerequisite)hedgeAfter=0); opt-in per deploymentChanges
pkg/util/retry/hedged.go— addCleanup func(T)field; called on a losing attempt's result when it succeeded but the other wonpkg/phlaredb/symdb/block_reader.go— addhedgeAftertoReader; extractgetRange()helpers onrawTable[T]andstacktraceBlockthat wrapGetRangewithretry.Hedgedwhen configuredpkg/block/object.go— addhedgeAfterfield +WithObjectHedgeAfteroptionpkg/block/section_symbols.go— threadhedgeAfterintosymdb.OpenObjectpkg/querybackend/block_reader.go— addWithBlockReaderHedgeAfteroption; pass toblock.NewObjectpkg/querybackend/backend.go— addBlockReadHedgeAfterconfig field + flagDepends on
Cleanupfield tests use synctest)Test plan
go test ./pkg/util/retry/... ./pkg/phlaredb/symdb/... ./pkg/querybackend/...--query-backend.block-read-hedge-after=2sand verifyGetRangeis issued twice for slow fetchesNote
Medium Risk
Introduces new speculative parallel
GetRangecalls and changesretry.Hedgedcancellation/cleanup semantics, which can affect concurrency behavior and resource usage under load. Disabled by default but could increase object-store traffic when enabled and needs careful tuning.Overview
Adds opt-in hedged object-store reads to reduce tail latency when fetching symbol tables. A new
query-backend.block-read-hedge-aftersetting is threaded fromquerybackend.ConfigintoBlockReader/block.Objectand down tosymdb.OpenObject, enabling a second speculativeGetRangeafter the configured delay.pkg/phlaredb/symdbnow wraps symbol table and stacktraceGetRangereads withretry.Hedged(including closing the losingio.ReadCloser).retry.Hedgeditself is updated to use per-attempt contexts (so the winning result remains usable) and adds aCleanuphook, with new tests covering winner-context behavior and cleanup semantics.Reviewed by Cursor Bugbot for commit 1a87ad3. Bugbot is set up for automated code reviews on this repo. Configure here.