fix(search): column bulk operations search not returning results at scale#27216
fix(search): column bulk operations search not returning results at scale#27216sonika-shah wants to merge 9 commits intomainfrom
Conversation
…cale When searching by column name pattern (e.g., "MAT") in column bulk operations, the composite aggregation returned ALL column names from matching documents, then post-filtered in Java. With 20000+ columns, the first composite page of 25 names rarely contained matches, so users saw 0 results. Switch to terms aggregation with `include` regex when a search pattern is set. This filters at the ES/OS aggregation level — only matching column names produce buckets. Two-phase approach: (1) lightweight names query to get all matching names + accurate total, (2) targeted data query with top_hits for the current page only.
a773a85 to
9f3b664
Compare
There was a problem hiding this comment.
Pull request overview
Fixes column-name search in Column Bulk Operations for very wide schemas (20k+ columns) by switching the columnNamePattern path from composite aggregation + Java post-filtering to a two-phase terms aggregation that filters bucket keys server-side using an include regexp.
Changes:
- Added
ColumnAggregator.toCaseInsensitiveRegex()to generate a Lucene-compatible, case-insensitive regexp forterms.include. - Implemented a pattern-search branch in both Elasticsearch and OpenSearch column aggregators using a two-phase
termsaggregation (names query + page data query). - Added unit tests for regex generation and edge cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/search/ColumnAggregator.java | Adds shared utility to build Lucene-compatible case-insensitive regex for terms include. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java | Adds pattern-search code path using two-phase terms aggregation and offset-based pagination cursor. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java | Mirrors the two-phase terms aggregation approach for OpenSearch and refactors bucket parsing. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/ColumnAggregatorTest.java | Adds unit tests validating regex generation behavior (case handling + escaping). |
Comments suppressed due to low confidence (2)
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:68
MAX_PATTERN_SEARCH_NAMESis hard-capped at 10,000 for the phase-1termsaggregation. On large schemas (e.g., 20k+ columns) a broad pattern (like a single character) can easily match >10k distinct column names, which will silently truncatematchingNames, undercounttotalUniqueColumns, and prevent users from paging to the missing matches. Consider paging the name collection (e.g., via composite agg withafter_key, or partitioning thetermsagg) or raising the limit to cover worst-case table sizes and explicitly detecting/tracking truncation when the limit is hit.
/** Max column names to retrieve in the names-only query during pattern search. */
private static final int MAX_PATTERN_SEARCH_NAMES = 10000;
/** Index configuration with field mappings for each entity type. Uses aliases defined in indexMapping.json */
private static final Map<String, IndexConfig> INDEX_CONFIGS =
Map.of(
"table",
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:70
- The phase-1 pattern search uses a
termsagg withsize=MAX_PATTERN_SEARCH_NAMES(10,000). If the pattern matches more than 10k distinct column names (common on 20k+ column tables for broad patterns), the names list andtotalUniqueColumnswill be truncated and the remaining matches become unreachable via pagination. Consider implementing a paged name scan (e.g., composite agg with cursor) or otherwise guaranteeing retrieval of all matching names (and/or surfacing a truncation indicator).
/** Max column names to retrieve in the names-only query during pattern search. */
private static final int MAX_PATTERN_SEARCH_NAMES = 10000;
/** Uses aliases defined in indexMapping.json */
private static final List<String> DATA_ASSET_INDEXES =
Arrays.asList("table", "dashboardDataModel", "topic", "searchIndex", "container");
🟡 Playwright Results — all passed (19 flaky)✅ 3668 passed · ❌ 0 failed · 🟡 19 flaky · ⏭️ 89 skipped
🟡 19 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
Code Review ✅ Approved 4 resolved / 4 findingsBulk search column operations now correctly return results at scale. The fix replaces the case-sensitive HashMap with a case-insensitive TreeSet and aligns unit tests with the Lucene/ES regex engine. ✅ 4 resolved✅ Bug: Unit tests validate Java regex, not Lucene/ES regex engine
✅ Edge Case: TreeSet case-insensitive dedup vs HashMap case-sensitive lookup
✅ Bug:
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
| ColumnGridResponse page2 = getColumnGrid(client, baseQuery + "&cursor=" + page1.getCursor()); | ||
| assertEquals(2, page2.getColumns().size(), "Page 2 should have exactly 2 columns"); | ||
| assertNotNull(page2.getCursor(), "Page 2 should have a cursor for next page"); | ||
|
|
||
| ColumnGridResponse page3 = getColumnGrid(client, baseQuery + "&cursor=" + page2.getCursor()); | ||
| assertEquals(1, page3.getColumns().size(), "Page 3 (last) should have exactly 1 column"); |
There was a problem hiding this comment.
The cursor is appended to the query string without URL-encoding. Since cursors are Base64, they can contain characters like +, /, and = that are not safe in a raw query param (e.g., + may be decoded as a space), which can make this pagination test flaky. Encode the cursor before concatenating it into queryParams (e.g., via URLEncoder.encode(cursor, UTF_8)).
|



Fixes #27227
Summary
columnNamePatternis set, switch from composite aggregation to terms aggregation withincluderegex — ES/OS filters at the aggregation level, so only matching column names produce bucketsHow it works: Two-phase terms aggregation
termsagg withincluderegex,size=10000, ordered by_keyasc → returns all matching column names + accurate total count in a single fast querytermsagg withinclude= exact page names +top_hits→ fetches full entity data for only the 25 names on the current pageWhy terms agg
include(regex)works even with flat objects (columns are not nested):include(regex)tests each ordinal independently against the regex — it doesn't matter that multiple values came from the same documentNon-search path (no
columnNamePattern): Unchanged — still uses composite aggregation with cursor-based pagination.Approaches considered and rejected
1. Composite agg + Java post-filter (previous approach — the bug)
String.contains()after2. Composite agg with query-level
regexpfilterregexpquery oncolumns.name.keywordto pre-filter documents before aggregation3. Composite agg + filter sub-agg +
bucket_selector(elastic/elasticsearch#29079)bucket_selectorpipeline agg to drop non-matching bucketsbucket_selectoris officially unsupported with composite (ES docs)4. Composite agg with runtime field + conditional
emit()emit(), composite paginates withafter_key5. Terms agg
include(regex)+exclude(array)for paginationinclude(regex). Next request addsexclude([...previously seen names...])to get the next batchincludewith array-basedexcludeis not supported on OpenSearch. Feature was added in ES 7.11 (elastic/elasticsearch#63325), but OpenSearch forked from ES 7.10.2 — before this was merged6. Terms agg
include(partition/num_partitions)+ query-levelregexppartitionandregexshare the sameincludeparameter — mutually exclusive. And query-level regexp has the same flat-object problem as Approach 27. Composite agg with
include/excludeon terms sourceWhy 10,000 cap on matching names
size— there is no cursor/pagination mechanismpartition/num_partitionscan't be combined withinclude(regex)(same field)Files changed
ColumnAggregator.javatoCaseInsensitiveRegex()utility (Lucene regex doesn't support(?i), so "MAT" →.*[mM][aA][tT].*)ElasticSearchColumnAggregator.javaaggregateColumnsWithPattern(),executeNamesQuery(),executePageDataQuery(). Extracted sharedparseBucketHits()andapplyTagPostFilter()to avoid duplication. Offset-based cursor for search paginationOpenSearchColumnAggregator.javaColumnAggregatorTest.javatoCaseInsensitiveRegex— case insensitivity, special char escaping, edge casesTest plan
ColumnAggregatorTest— 8 unit tests for regex generation (all pass)ColumnMetadataGrouperTest— 7 existing tests still pass (no regression)mvn compile— clean build🤖 Generated with Claude Code
Summary by Gitar
ColumnGridResourceITto useawait()polling for search index consistencyuntilAssertedto eliminate race conditions intest_getColumnGrid_patternPlusGlossaryFilterandtest_getColumnGrid_glossaryFilter_onlyReturnsGlossaryOccurrencesThis will update automatically on new commits.