Skip to content

fix: invalid memory read and threads blocked issues#35166

Open
dapan1121 wants to merge 7 commits into3.0from
fix/6973144195
Open

fix: invalid memory read and threads blocked issues#35166
dapan1121 wants to merge 7 commits into3.0from
fix/6973144195

Conversation

@dapan1121
Copy link
Copy Markdown
Contributor

…canoperator

Three functions in sysscanoperator.c used tsem_timewait with unsafe callback parameter lifetime, causing ASAN errors in test_mnode_basic2.py -N 2:

  1. sysTableScanFromMNode: passed raw pOperator* as RPC callback param. tsem_timewait timeout -> T_LONG_JMP -> destroyOperator freed pOperator -> callback fired on freed heap (heap-use-after-free).

  2. vtbRefGetDbVgInfo: passed &ctx (stack SVtbRefValidateCtx) as callback param. tsem_timewait timeout -> function returned -> callback fired on dead stack (stack-use-after-return).

  3. vtbRefFetchTableSchema: same stack-use-after-return pattern.

Additionally, tsem_timewait is unsafe in the query worker pool: if every worker thread blocks in tsem_timewait the pool is fully stalled.

Fix for sysTableScanFromMNode (exchange-operator pattern):

  • Add sysTableScanRefPool (TdRefId-based, separate from fetchObjRefPool).
  • Add SSysTableScanInfo.self (ref ID) and .rspCode fields.
  • Callback stores only SSysTableScanCbParam{int64_t sysTableScanId}, heap- allocated, freed via paramFreeFp = taosAutoMemoryFree.
  • Callback: taosAcquireRef -> if NULL discard safely -> taosReleaseRef.
  • destroySysScanOperator calls taosRemoveRef; actual cleanup in doDestroySysTableScanInfo (pool destructor), deferred until refcount=0.
  • Replace tsem_timewait with beforeBlocking+tsem_wait+afterRecoverFromBlocking.

Fix for vtbRefGetDbVgInfo and vtbRefFetchTableSchema:

  • Add volatile int32_t refCount to SVtbRefValidateCtx.
  • Add vtbRefValidateCtxDecRef() helper (frees when refCount reaches 0).
  • Heap-allocate pCtx with refCount=2 (waiter + callback).
  • vtbRefValidateCallback calls vtbRefValidateCtxDecRef instead of relying on caller cleanup.
  • Add SQueryAutoQWorkerPoolCB* pWorkerCb parameter to both functions and to vtbRefValidateRemote; propagated from pTaskInfo->pWorkerCb at call site.
  • Replace tsem_timewait with beforeBlocking+tsem_wait+afterRecoverFromBlocking.
  • _return: calls vtbRefValidateCtxDecRef(pCtx) to release waiter reference.

Verified: 20/20 passes of test_mnode_basic2.py -N 2 with ASAN build, 0 errors.

Description

Issue(s)

  • Close/close/Fix/fix/Resolve/resolve: Issue Link

Checklist

Please check the items in the checklist if applicable.

  • Is the user manual updated?
  • Are the test cases passed and automated?
  • Is there no significant decrease in test coverage?

…canoperator

Three functions in sysscanoperator.c used tsem_timewait with unsafe callback
parameter lifetime, causing ASAN errors in test_mnode_basic2.py -N 2:

1. sysTableScanFromMNode: passed raw pOperator* as RPC callback param.
   tsem_timewait timeout -> T_LONG_JMP -> destroyOperator freed pOperator ->
   callback fired on freed heap (heap-use-after-free).

2. vtbRefGetDbVgInfo: passed &ctx (stack SVtbRefValidateCtx) as callback param.
   tsem_timewait timeout -> function returned -> callback fired on dead stack
   (stack-use-after-return).

3. vtbRefFetchTableSchema: same stack-use-after-return pattern.

Additionally, tsem_timewait is unsafe in the query worker pool: if every worker
thread blocks in tsem_timewait the pool is fully stalled.

Fix for sysTableScanFromMNode (exchange-operator pattern):
- Add sysTableScanRefPool (TdRefId-based, separate from fetchObjRefPool).
- Add SSysTableScanInfo.self (ref ID) and .rspCode fields.
- Callback stores only SSysTableScanCbParam{int64_t sysTableScanId}, heap-
  allocated, freed via paramFreeFp = taosAutoMemoryFree.
- Callback: taosAcquireRef -> if NULL discard safely -> taosReleaseRef.
- destroySysScanOperator calls taosRemoveRef; actual cleanup in
  doDestroySysTableScanInfo (pool destructor), deferred until refcount=0.
- Replace tsem_timewait with beforeBlocking+tsem_wait+afterRecoverFromBlocking.

Fix for vtbRefGetDbVgInfo and vtbRefFetchTableSchema:
- Add volatile int32_t refCount to SVtbRefValidateCtx.
- Add vtbRefValidateCtxDecRef() helper (frees when refCount reaches 0).
- Heap-allocate pCtx with refCount=2 (waiter + callback).
- vtbRefValidateCallback calls vtbRefValidateCtxDecRef instead of relying on
  caller cleanup.
- Add SQueryAutoQWorkerPoolCB* pWorkerCb parameter to both functions and to
  vtbRefValidateRemote; propagated from pTaskInfo->pWorkerCb at call site.
- Replace tsem_timewait with beforeBlocking+tsem_wait+afterRecoverFromBlocking.
- _return: calls vtbRefValidateCtxDecRef(pCtx) to release waiter reference.

Verified: 20/20 passes of test_mnode_basic2.py -N 2 with ASAN build, 0 errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 17, 2026 02:32
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a reference counting mechanism and a global reference pool to manage the lifecycle of scan information and validation contexts. These changes are designed to prevent use-after-free vulnerabilities during asynchronous RPC callbacks, especially when the calling thread exits early due to errors or timeouts. The implementation also integrates with the worker pool to signal blocking operations. Feedback was provided regarding potential memory leaks in the error handling paths of vtbRefGetDbVgInfo and vtbRefFetchTableSchema, where the reference count may not be fully decremented if an error occurs before the RPC is dispatched.

Comment thread source/libs/executor/src/sysscanoperator.c Outdated
Comment thread source/libs/executor/src/sysscanoperator.c Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes ASAN-reported use-after-free / stack-use-after-return in sysscanoperator.c by making RPC callback parameters safe across timeouts/operator teardown, and by integrating “worker pool beforeBlocking/afterRecoverFromBlocking” hooks to avoid stalling the query worker pool while waiting for RPC responses.

Changes:

  • Introduces a per-file ref-pool (sysTableScanRefPool) and a lightweight callback parameter wrapper to safely acquire/release SSysTableScanInfo in loadSysTableCallback.
  • Refactors virtual table reference validation RPCs to use heap-allocated SVtbRefValidateCtx with refcounted lifetime management.
  • Replaces tsem_timewait with beforeBlocking + tsem_wait + afterRecoverFromBlocking in affected RPC wait paths.
Comments suppressed due to low confidence (1)

source/libs/executor/src/sysscanoperator.c:5634

  • doDestroySysTableScanInfo unconditionally calls tsem_destroy(&pInfo->ready), but pInfo->ready is only initialized in createSysTableScanOperatorInfo's MNode path. For local-scan operators this can hit EINVAL (and log an error) when destroying an uninitialized semaphore. Track whether ready was initialized (e.g., a readyInited flag) or always tsem_init it for all paths and only wait on it when needed.
  SSysTableScanInfo* pInfo = (SSysTableScanInfo*)param;
  int32_t            code = tsem_destroy(&pInfo->ready);
  if (code != TSDB_CODE_SUCCESS) {
    qError("%s failed at line %d since %s", __func__, __LINE__, tstrerror(code));
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/executor/src/sysscanoperator.c
Comment thread source/libs/executor/src/sysscanoperator.c
Comment thread source/libs/executor/src/sysscanoperator.c
Comment thread source/libs/executor/src/sysscanoperator.c
…elpers

In vtbRefGetDbVgInfo and vtbRefFetchTableSchema, the SVtbRefValidateCtx
refCount was initialised to 2 (waiter + callback) before the message was
even sent. Any early-exit path that runs before asyncSendMsgToServer
(e.g. tsem_init failure, serialisation OOM, pMsgSendInfo alloc failure)
would jump to _return, call vtbRefValidateCtxDecRef once (2->1), and
leave the struct permanently leaked because the callback -- which was
never dispatched -- would never decrement it to 0.

Fix: initialise refCount to 1 (waiter only). Bump to 2 immediately
before asyncSendMsgToServer. If the send fails, roll back the bump
synchronously; _return then brings it to 0 and frees the struct.
…it cleanup for sysTableScanRefPool

- vtbRefValidateCallback: free pMsg->pEpSet on all paths (success+error)
- loadSysTableCallback: free pMsg->pEpSet on early-return and normal paths
- loadSysTableCallback: free pMsg->pData on error branch (was neither
  transferred nor freed)
- Register atexit(cleanupSysTableScanRefPool) to close the ref pool on
  process exit, matching the fetchObjRefPool pattern in executor.c
Copilot AI review requested due to automatic review settings April 17, 2026 02:59
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/executor/src/sysscanoperator.c Outdated
Comment thread source/libs/executor/src/sysscanoperator.c Outdated
dapan1121 and others added 2 commits April 17, 2026 13:45
- Add missing 'static' to loadSysTableCallback definition to match its
  earlier static prototype (linkage consistency)
- Update SVtbRefValidateCtx.refCount comment to accurately describe the
  actual lifecycle: starts at 1 (waiter only), bumped to 2 immediately
  before asyncSendMsgToServer (waiter + callback)
- Replace (void)taosRemoveRef / (void)taosReleaseRef with proper error
  checking and qError logging, consistent with exchangeoperator.c pattern

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…o satisfy clang-query rule

The filter_for_return_values clang-query rule matches any callExpr whose
direct parent is compoundStmt and whose return type is non-void.

TAOS_UNUSED(callExpr()) wraps the callExpr in a CStyleCastExpr, changing
its parent from compoundStmt to CStyleCastExpr, but the rule was still
matching in practice.  Assigning the result to a local variable (whose
parent is VarDecl/DeclStmt, not compoundStmt) definitively satisfies the
rule.  The variable is then silenced with TAOS_UNUSED to suppress any
unused-variable warning.

Affected sites: vtbRefGetDbVgInfo and vtbRefFetchTableSchema, the two
bump (refCount→2) and roll-back (refCount→1) atomic operations that are
not consumed by any further logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 17, 2026 06:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/executor/src/sysscanoperator.c Outdated
Comment thread source/libs/executor/src/sysscanoperator.c Outdated
Comment thread source/libs/executor/src/sysscanoperator.c Outdated
dapan1121 and others added 2 commits April 17, 2026 14:58
Three review issues fixed:

1. destroySysScanOperator linkage: the function was declared static at the
   top of the file but defined without static — add static to the definition
   to match the forward declaration.

2. doDestroySysTableScanInfo comment: the old comment said the function
   'Must NOT be called directly', but destroySysScanOperator legitimately
   calls it directly for the local-scan path (self == 0, no ref pool).
   Update the comment to clarify both usage paths.

3. Consolidate worker-pool blocking via qSemWait: replace the three
   open-coded beforeBlocking/tsem_wait/afterRecoverFromBlocking patterns
   in vtbRefGetDbVgInfo, vtbRefFetchTableSchema and sysTableScanFromMNode
   with qSemWait(pTaskInfo, sem), which centralises error handling and
   reduces the risk of inconsistency in future edits.
   To enable this, vtbRefGetDbVgInfo, vtbRefFetchTableSchema and
   vtbRefValidateRemote now take SExecTaskInfo* pTaskInfo instead of the
   narrower SQueryAutoQWorkerPoolCB* pWorkerCb.  The sole external call
   site in validateSrcTableColRef is updated accordingly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…eliminate use-after-free race

The previous ordering was:
  1. tsem_post(&pInfo->ready)     -- wakes the waiter
  2. taosReleaseRef(...)          -- drops callback's ref (2→0 if waiter raced)

When the waiter woke up early enough, the sequence became:
  T_waiter: task completes → taosRemoveRef (2→1) → doDestroyTask frees pool
            (which owns pInfo since it was pool-allocated)
  T_callback: taosReleaseRef (1→0) → doDestroySysTableScanInfo(pInfo) → use-after-free

Fix: call taosReleaseRef BEFORE tsem_post.
  1. taosReleaseRef(...)  -- drops callback's ref (2→1); destructor NOT called
  2. tsem_post(...)       -- wake waiter; pInfo still valid (count=1)

After waking, taosRemoveRef drops count to 0 and the destructor runs while
pInfo is still live.  The pool cleanup in doDestroyTask then finds pInfo
already freed by the destructor.

Verified: 100 consecutive passes of test_match.py with ASAN enabled, zero
heap-use-after-free errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 17, 2026 11:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/executor/src/sysscanoperator.c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants