Skip to content

func/regexp_extract: new scalar func and test cases#35191

Open
stephenkgu wants to merge 50 commits into3.0from
fix/6968250338
Open

func/regexp_extract: new scalar func and test cases#35191
stephenkgu wants to merge 50 commits into3.0from
fix/6968250338

Conversation

@stephenkgu
Copy link
Copy Markdown
Contributor

Description

Issue(s)

Checklist

Please check the items in the checklist if applicable.

  • Is the user manual updated?
  • Are the test cases passed and automated?
  • Is there no significant decrease in test coverage?

Copilot AI review requested due to automatic review settings April 21, 2026 08:40
@stephenkgu stephenkgu requested review from a team, dapan1121 and guanshengliang as code owners April 21, 2026 08:40
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the regexp_extract scalar function, which enables substring extraction using POSIX Extended Regular Expressions from both VARCHAR and NCHAR inputs. The implementation adds a translation phase for parameter validation and an execution phase that handles character set conversion and capture group indexing. Feedback points out that the translation phase does not correctly handle NCHAR patterns during validation and that regfree should only be called upon successful regex compilation to avoid undefined behavior.

Comment thread source/libs/function/src/builtins.c Outdated
Comment thread source/libs/function/src/builtins.c Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new scalar string function regexp_extract() to the function framework, plus an accompanying scalar-function test suite to validate expected matching/group semantics across constant expressions and table columns.

Changes:

  • Introduces regexp_extract(str, pattern[, group_idx]) as a builtin scalar function (planner translation + scalar execution).
  • Implements the scalar execution logic with POSIX ERE compilation via the existing threadGetRegComp() cache and per-row regexec().
  • Adds comprehensive Python test coverage for basic behavior, edge cases (NULL/no-match/empty), and error cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/cases/11-Functions/01-Scalar/test_fun_sca_regexp_extract.py New end-to-end tests for regexp_extract() behavior and error handling (including WAL/flush re-check).
source/libs/scalar/src/sclfunc.c Adds regexpExtractFunction() implementation (pattern conversion/compile + per-row extraction).
source/libs/function/src/builtins.c Registers builtin + adds translateRegexpExtract() validation and sets return type to match input string type.
include/libs/scalar/scalar.h Exposes regexpExtractFunction() prototype to the scalar/function layer.
include/libs/function/functionMgt.h Adds new enum value FUNCTION_TYPE_REGEXP_EXTRACT.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/function/src/builtins.c Outdated
Comment thread source/libs/scalar/src/sclfunc.c
Comment thread source/libs/scalar/src/sclfunc.c Outdated
Comment thread source/libs/scalar/src/sclfunc.c Outdated
Copilot AI review requested due to automatic review settings April 21, 2026 09:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/scalar/src/sclfunc.c
Comment thread source/libs/scalar/src/sclfunc.c Outdated
Comment thread source/libs/function/src/builtins.c Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 21, 2026 09:45
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/scalar/src/sclfunc.c Outdated
Comment thread source/libs/scalar/src/sclfunc.c Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 21, 2026 09:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/function/src/builtins.c Outdated
@stephenkgu
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the regexp_extract scalar function, which allows users to extract substrings from VARCHAR and NCHAR columns using POSIX regular expressions and capture groups. The implementation includes a translation phase for validating constant patterns and group indices, as well as the core execution logic that handles regex compilation and character set conversions. Review feedback identifies a memory leak in the NCHAR conversion error path within the main processing loop and suggests optimizing memory allocations for null-terminated strings to improve performance during row iteration.

Comment thread source/libs/scalar/src/sclfunc.c Outdated
Comment thread source/libs/scalar/src/sclfunc.c Outdated
Copilot AI review requested due to automatic review settings April 21, 2026 10:38
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/scalar/src/sclfunc.c
Copilot AI review requested due to automatic review settings April 23, 2026 09:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/scalar/src/sclfunc.c Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 23, 2026 09:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/en/14-reference/03-taos-sql/22-function.md Outdated
Comment thread test/cases/11-Functions/01-Scalar/test_fun_sca_regexp_extract.py
Comment thread source/libs/function/src/builtins.c Outdated
Comment thread docs/zh/14-reference/03-taos-sql/22-function.md Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 23, 2026 09:44
stephenkgu and others added 3 commits April 23, 2026 02:45
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/scalar/src/sclfunc.c
Comment thread test/cases/11-Functions/01-Scalar/test_fun_sca_regexp_extract.py Outdated
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 23, 2026 09:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/libs/scalar/src/sclfunc.c
Comment thread source/libs/scalar/src/sclfunc.c
Copilot AI review requested due to automatic review settings April 23, 2026 10:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@stephenkgu
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the REGEXP_EXTRACT scalar function, enabling substring extraction using POSIX extended regular expressions with support for specific capture groups. The changes include documentation, translation-time validation, runtime execution logic, and comprehensive Python tests. Feedback focuses on critical buffer overflow risks during NCHAR to UTF-8 conversion, where the allocation size fails to account for the potential 4-byte expansion of UCS-4 characters. Additionally, the reviewer noted that several calls to regerror were incorrectly modified to pass a NULL pointer instead of the regex object, which should be corrected to ensure portable and detailed error reporting.

if (!deferValidation) {
if (pPatVal->node.resType.type == TSDB_DATA_TYPE_NCHAR && pPatVal->datum.p != NULL) {
int32_t ncharBytes = varDataLen(pPatVal->datum.p);
utf8Pat = taosMemoryCalloc(ncharBytes + 1, 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The buffer allocation for the UTF-8 pattern is insufficient when the input is NCHAR. In TDengine, varDataLen for NCHAR returns the number of characters, but a single UCS-4 character can require up to 4 bytes in UTF-8. Allocating only ncharBytes + 1 bytes can lead to a buffer overflow during taosUcs4ToMbs.

        utf8Pat = taosMemoryCalloc(ncharBytes * 4 + 1, 1);

// For NCHAR: UTF-8 output is at most strLen bytes (UCS-4 byte count >= UTF-8 byte count),
// so strLen + 1 is a safe upper bound for both NCHAR and VARCHAR paths.
if (strLen + 1 > strNtCap) {
char *tmp = taosMemoryRealloc(strNt, strLen + 1);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The scratch buffer strNt is used to hold the NUL-terminated UTF-8 version of the input string. When the input is NCHAR, strLen represents the number of characters. Since one UCS-4 character can occupy up to 4 bytes in UTF-8, the allocation must account for this to avoid a buffer overflow in taosUcs4ToMbs.

      char *tmp = taosMemoryRealloc(strNt, (isNchar ? strLen * 4 : strLen) + 1);

int ret = regcomp(&re, regPattern, REG_EXTENDED);
if (ret != 0) {
char msgbuf[256] = {0};
(void)regerror(ret, NULL, msgbuf, sizeof(msgbuf));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the POSIX standard, regerror should be passed the pointer to the regex_t structure that was used in the regcomp call that failed. Passing NULL is not recommended as it may prevent the library from providing a specific error message.

        (void)regerror(ret, &re, msgbuf, sizeof(msgbuf));

if (ret != 0) {
char msgbuf[256] = {0};
(void)regerror(ret, &regex, msgbuf, tListLen(msgbuf));
(void)regerror(ret, NULL, msgbuf, tListLen(msgbuf));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the POSIX standard, regerror should be passed the pointer to the regex_t structure that was used in the regcomp or regexec call that failed. While some implementations might handle a NULL pointer, it is safer and more portable to provide the actual structure to allow the library to extract more detailed error information.

    (void)regerror(ret, &regex, msgbuf, tListLen(msgbuf));

if (ret != 0) {
char msgbuf[256] = {0};
(void)regerror(ret, &pUsingRegex->pRegex, msgbuf, tListLen(msgbuf));
(void)regerror(ret, NULL, msgbuf, tListLen(msgbuf));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing NULL to regerror is not recommended. It should receive the pointer to the regex_t structure associated with the failure.

    (void)regerror(ret, &pUsingRegex->pRegex, msgbuf, tListLen(msgbuf));

if (ret != 0) {
char msgbuf[256] = {0};
(void)regerror(ret, &gRegex, msgbuf, tListLen(msgbuf));
(void)regerror(ret, NULL, msgbuf, tListLen(msgbuf));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing NULL to regerror is not recommended. It should receive the pointer to the regex_t structure associated with the failure.

    (void)regerror(ret, &gRegex, msgbuf, tListLen(msgbuf));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants