func/regexp_extract: new scalar func and test cases#35191
func/regexp_extract: new scalar func and test cases#35191stephenkgu wants to merge 50 commits into3.0from
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the regexp_extract scalar function, which enables substring extraction using POSIX Extended Regular Expressions from both VARCHAR and NCHAR inputs. The implementation adds a translation phase for parameter validation and an execution phase that handles character set conversion and capture group indexing. Feedback points out that the translation phase does not correctly handle NCHAR patterns during validation and that regfree should only be called upon successful regex compilation to avoid undefined behavior.
There was a problem hiding this comment.
Pull request overview
Adds a new scalar string function regexp_extract() to the function framework, plus an accompanying scalar-function test suite to validate expected matching/group semantics across constant expressions and table columns.
Changes:
- Introduces
regexp_extract(str, pattern[, group_idx])as a builtin scalar function (planner translation + scalar execution). - Implements the scalar execution logic with POSIX ERE compilation via the existing
threadGetRegComp()cache and per-rowregexec(). - Adds comprehensive Python test coverage for basic behavior, edge cases (NULL/no-match/empty), and error cases.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/cases/11-Functions/01-Scalar/test_fun_sca_regexp_extract.py | New end-to-end tests for regexp_extract() behavior and error handling (including WAL/flush re-check). |
| source/libs/scalar/src/sclfunc.c | Adds regexpExtractFunction() implementation (pattern conversion/compile + per-row extraction). |
| source/libs/function/src/builtins.c | Registers builtin + adds translateRegexpExtract() validation and sets return type to match input string type. |
| include/libs/scalar/scalar.h | Exposes regexpExtractFunction() prototype to the scalar/function layer. |
| include/libs/function/functionMgt.h | Adds new enum value FUNCTION_TYPE_REGEXP_EXTRACT. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements the regexp_extract scalar function, which allows users to extract substrings from VARCHAR and NCHAR columns using POSIX regular expressions and capture groups. The implementation includes a translation phase for validating constant patterns and group indices, as well as the core execution logic that handles regex compilation and character set conversions. Review feedback identifies a memory leak in the NCHAR conversion error path within the main processing loop and suggests optimizing memory allocations for null-terminated strings to improve performance during row iteration.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the REGEXP_EXTRACT scalar function, enabling substring extraction using POSIX extended regular expressions with support for specific capture groups. The changes include documentation, translation-time validation, runtime execution logic, and comprehensive Python tests. Feedback focuses on critical buffer overflow risks during NCHAR to UTF-8 conversion, where the allocation size fails to account for the potential 4-byte expansion of UCS-4 characters. Additionally, the reviewer noted that several calls to regerror were incorrectly modified to pass a NULL pointer instead of the regex object, which should be corrected to ensure portable and detailed error reporting.
| if (!deferValidation) { | ||
| if (pPatVal->node.resType.type == TSDB_DATA_TYPE_NCHAR && pPatVal->datum.p != NULL) { | ||
| int32_t ncharBytes = varDataLen(pPatVal->datum.p); | ||
| utf8Pat = taosMemoryCalloc(ncharBytes + 1, 1); |
There was a problem hiding this comment.
The buffer allocation for the UTF-8 pattern is insufficient when the input is NCHAR. In TDengine, varDataLen for NCHAR returns the number of characters, but a single UCS-4 character can require up to 4 bytes in UTF-8. Allocating only ncharBytes + 1 bytes can lead to a buffer overflow during taosUcs4ToMbs.
utf8Pat = taosMemoryCalloc(ncharBytes * 4 + 1, 1);| // For NCHAR: UTF-8 output is at most strLen bytes (UCS-4 byte count >= UTF-8 byte count), | ||
| // so strLen + 1 is a safe upper bound for both NCHAR and VARCHAR paths. | ||
| if (strLen + 1 > strNtCap) { | ||
| char *tmp = taosMemoryRealloc(strNt, strLen + 1); |
There was a problem hiding this comment.
The scratch buffer strNt is used to hold the NUL-terminated UTF-8 version of the input string. When the input is NCHAR, strLen represents the number of characters. Since one UCS-4 character can occupy up to 4 bytes in UTF-8, the allocation must account for this to avoid a buffer overflow in taosUcs4ToMbs.
char *tmp = taosMemoryRealloc(strNt, (isNchar ? strLen * 4 : strLen) + 1);| int ret = regcomp(&re, regPattern, REG_EXTENDED); | ||
| if (ret != 0) { | ||
| char msgbuf[256] = {0}; | ||
| (void)regerror(ret, NULL, msgbuf, sizeof(msgbuf)); |
There was a problem hiding this comment.
| if (ret != 0) { | ||
| char msgbuf[256] = {0}; | ||
| (void)regerror(ret, ®ex, msgbuf, tListLen(msgbuf)); | ||
| (void)regerror(ret, NULL, msgbuf, tListLen(msgbuf)); |
There was a problem hiding this comment.
According to the POSIX standard, regerror should be passed the pointer to the regex_t structure that was used in the regcomp or regexec call that failed. While some implementations might handle a NULL pointer, it is safer and more portable to provide the actual structure to allow the library to extract more detailed error information.
(void)regerror(ret, ®ex, msgbuf, tListLen(msgbuf));| if (ret != 0) { | ||
| char msgbuf[256] = {0}; | ||
| (void)regerror(ret, &pUsingRegex->pRegex, msgbuf, tListLen(msgbuf)); | ||
| (void)regerror(ret, NULL, msgbuf, tListLen(msgbuf)); |
| if (ret != 0) { | ||
| char msgbuf[256] = {0}; | ||
| (void)regerror(ret, &gRegex, msgbuf, tListLen(msgbuf)); | ||
| (void)regerror(ret, NULL, msgbuf, tListLen(msgbuf)); |
Description
Issue(s)
https://project.feishu.cn/taosdata_td/feature/detail/6968250338
Checklist
Please check the items in the checklist if applicable.