fix: invalid memory read and threads blocked issues#35166
fix: invalid memory read and threads blocked issues#35166
Conversation
…canoperator
Three functions in sysscanoperator.c used tsem_timewait with unsafe callback
parameter lifetime, causing ASAN errors in test_mnode_basic2.py -N 2:
1. sysTableScanFromMNode: passed raw pOperator* as RPC callback param.
tsem_timewait timeout -> T_LONG_JMP -> destroyOperator freed pOperator ->
callback fired on freed heap (heap-use-after-free).
2. vtbRefGetDbVgInfo: passed &ctx (stack SVtbRefValidateCtx) as callback param.
tsem_timewait timeout -> function returned -> callback fired on dead stack
(stack-use-after-return).
3. vtbRefFetchTableSchema: same stack-use-after-return pattern.
Additionally, tsem_timewait is unsafe in the query worker pool: if every worker
thread blocks in tsem_timewait the pool is fully stalled.
Fix for sysTableScanFromMNode (exchange-operator pattern):
- Add sysTableScanRefPool (TdRefId-based, separate from fetchObjRefPool).
- Add SSysTableScanInfo.self (ref ID) and .rspCode fields.
- Callback stores only SSysTableScanCbParam{int64_t sysTableScanId}, heap-
allocated, freed via paramFreeFp = taosAutoMemoryFree.
- Callback: taosAcquireRef -> if NULL discard safely -> taosReleaseRef.
- destroySysScanOperator calls taosRemoveRef; actual cleanup in
doDestroySysTableScanInfo (pool destructor), deferred until refcount=0.
- Replace tsem_timewait with beforeBlocking+tsem_wait+afterRecoverFromBlocking.
Fix for vtbRefGetDbVgInfo and vtbRefFetchTableSchema:
- Add volatile int32_t refCount to SVtbRefValidateCtx.
- Add vtbRefValidateCtxDecRef() helper (frees when refCount reaches 0).
- Heap-allocate pCtx with refCount=2 (waiter + callback).
- vtbRefValidateCallback calls vtbRefValidateCtxDecRef instead of relying on
caller cleanup.
- Add SQueryAutoQWorkerPoolCB* pWorkerCb parameter to both functions and to
vtbRefValidateRemote; propagated from pTaskInfo->pWorkerCb at call site.
- Replace tsem_timewait with beforeBlocking+tsem_wait+afterRecoverFromBlocking.
- _return: calls vtbRefValidateCtxDecRef(pCtx) to release waiter reference.
Verified: 20/20 passes of test_mnode_basic2.py -N 2 with ASAN build, 0 errors.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a reference counting mechanism and a global reference pool to manage the lifecycle of scan information and validation contexts. These changes are designed to prevent use-after-free vulnerabilities during asynchronous RPC callbacks, especially when the calling thread exits early due to errors or timeouts. The implementation also integrates with the worker pool to signal blocking operations. Feedback was provided regarding potential memory leaks in the error handling paths of vtbRefGetDbVgInfo and vtbRefFetchTableSchema, where the reference count may not be fully decremented if an error occurs before the RPC is dispatched.
There was a problem hiding this comment.
Pull request overview
This PR fixes ASAN-reported use-after-free / stack-use-after-return in sysscanoperator.c by making RPC callback parameters safe across timeouts/operator teardown, and by integrating “worker pool beforeBlocking/afterRecoverFromBlocking” hooks to avoid stalling the query worker pool while waiting for RPC responses.
Changes:
- Introduces a per-file ref-pool (
sysTableScanRefPool) and a lightweight callback parameter wrapper to safely acquire/releaseSSysTableScanInfoinloadSysTableCallback. - Refactors virtual table reference validation RPCs to use heap-allocated
SVtbRefValidateCtxwith refcounted lifetime management. - Replaces
tsem_timewaitwithbeforeBlocking + tsem_wait + afterRecoverFromBlockingin affected RPC wait paths.
Comments suppressed due to low confidence (1)
source/libs/executor/src/sysscanoperator.c:5634
doDestroySysTableScanInfounconditionally callstsem_destroy(&pInfo->ready), butpInfo->readyis only initialized increateSysTableScanOperatorInfo's MNode path. For local-scan operators this can hitEINVAL(and log an error) when destroying an uninitialized semaphore. Track whetherreadywas initialized (e.g., areadyInitedflag) or alwaystsem_initit for all paths and only wait on it when needed.
SSysTableScanInfo* pInfo = (SSysTableScanInfo*)param;
int32_t code = tsem_destroy(&pInfo->ready);
if (code != TSDB_CODE_SUCCESS) {
qError("%s failed at line %d since %s", __func__, __LINE__, tstrerror(code));
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…elpers In vtbRefGetDbVgInfo and vtbRefFetchTableSchema, the SVtbRefValidateCtx refCount was initialised to 2 (waiter + callback) before the message was even sent. Any early-exit path that runs before asyncSendMsgToServer (e.g. tsem_init failure, serialisation OOM, pMsgSendInfo alloc failure) would jump to _return, call vtbRefValidateCtxDecRef once (2->1), and leave the struct permanently leaked because the callback -- which was never dispatched -- would never decrement it to 0. Fix: initialise refCount to 1 (waiter only). Bump to 2 immediately before asyncSendMsgToServer. If the send fails, roll back the bump synchronously; _return then brings it to 0 and frees the struct.
…it cleanup for sysTableScanRefPool - vtbRefValidateCallback: free pMsg->pEpSet on all paths (success+error) - loadSysTableCallback: free pMsg->pEpSet on early-return and normal paths - loadSysTableCallback: free pMsg->pData on error branch (was neither transferred nor freed) - Register atexit(cleanupSysTableScanRefPool) to close the ref pool on process exit, matching the fetchObjRefPool pattern in executor.c
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add missing 'static' to loadSysTableCallback definition to match its earlier static prototype (linkage consistency) - Update SVtbRefValidateCtx.refCount comment to accurately describe the actual lifecycle: starts at 1 (waiter only), bumped to 2 immediately before asyncSendMsgToServer (waiter + callback) - Replace (void)taosRemoveRef / (void)taosReleaseRef with proper error checking and qError logging, consistent with exchangeoperator.c pattern Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…o satisfy clang-query rule The filter_for_return_values clang-query rule matches any callExpr whose direct parent is compoundStmt and whose return type is non-void. TAOS_UNUSED(callExpr()) wraps the callExpr in a CStyleCastExpr, changing its parent from compoundStmt to CStyleCastExpr, but the rule was still matching in practice. Assigning the result to a local variable (whose parent is VarDecl/DeclStmt, not compoundStmt) definitively satisfies the rule. The variable is then silenced with TAOS_UNUSED to suppress any unused-variable warning. Affected sites: vtbRefGetDbVgInfo and vtbRefFetchTableSchema, the two bump (refCount→2) and roll-back (refCount→1) atomic operations that are not consumed by any further logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Three review issues fixed: 1. destroySysScanOperator linkage: the function was declared static at the top of the file but defined without static — add static to the definition to match the forward declaration. 2. doDestroySysTableScanInfo comment: the old comment said the function 'Must NOT be called directly', but destroySysScanOperator legitimately calls it directly for the local-scan path (self == 0, no ref pool). Update the comment to clarify both usage paths. 3. Consolidate worker-pool blocking via qSemWait: replace the three open-coded beforeBlocking/tsem_wait/afterRecoverFromBlocking patterns in vtbRefGetDbVgInfo, vtbRefFetchTableSchema and sysTableScanFromMNode with qSemWait(pTaskInfo, sem), which centralises error handling and reduces the risk of inconsistency in future edits. To enable this, vtbRefGetDbVgInfo, vtbRefFetchTableSchema and vtbRefValidateRemote now take SExecTaskInfo* pTaskInfo instead of the narrower SQueryAutoQWorkerPoolCB* pWorkerCb. The sole external call site in validateSrcTableColRef is updated accordingly. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eliminate use-after-free race
The previous ordering was:
1. tsem_post(&pInfo->ready) -- wakes the waiter
2. taosReleaseRef(...) -- drops callback's ref (2→0 if waiter raced)
When the waiter woke up early enough, the sequence became:
T_waiter: task completes → taosRemoveRef (2→1) → doDestroyTask frees pool
(which owns pInfo since it was pool-allocated)
T_callback: taosReleaseRef (1→0) → doDestroySysTableScanInfo(pInfo) → use-after-free
Fix: call taosReleaseRef BEFORE tsem_post.
1. taosReleaseRef(...) -- drops callback's ref (2→1); destructor NOT called
2. tsem_post(...) -- wake waiter; pInfo still valid (count=1)
After waking, taosRemoveRef drops count to 0 and the destructor runs while
pInfo is still live. The pool cleanup in doDestroyTask then finds pInfo
already freed by the destructor.
Verified: 100 consecutive passes of test_match.py with ASAN enabled, zero
heap-use-after-free errors.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…canoperator
Three functions in sysscanoperator.c used tsem_timewait with unsafe callback parameter lifetime, causing ASAN errors in test_mnode_basic2.py -N 2:
sysTableScanFromMNode: passed raw pOperator* as RPC callback param. tsem_timewait timeout -> T_LONG_JMP -> destroyOperator freed pOperator -> callback fired on freed heap (heap-use-after-free).
vtbRefGetDbVgInfo: passed &ctx (stack SVtbRefValidateCtx) as callback param. tsem_timewait timeout -> function returned -> callback fired on dead stack (stack-use-after-return).
vtbRefFetchTableSchema: same stack-use-after-return pattern.
Additionally, tsem_timewait is unsafe in the query worker pool: if every worker thread blocks in tsem_timewait the pool is fully stalled.
Fix for sysTableScanFromMNode (exchange-operator pattern):
Fix for vtbRefGetDbVgInfo and vtbRefFetchTableSchema:
Verified: 20/20 passes of test_mnode_basic2.py -N 2 with ASAN build, 0 errors.
Description
Issue(s)
Checklist
Please check the items in the checklist if applicable.