fix(sampler): Respect randomizedSample flag at 100% percentage sampling#26966
Conversation
When profileSample is 100% with PERCENTAGE type, the sampler short-circuits and returns the raw dataset without any randomization, even when randomizedSample is True (the default). Split the combined condition so: - No profileSample set -> return raw dataset (no sampling configured) - 100% PERCENTAGE + randomizedSample=False -> return raw dataset (optimization) - 100% PERCENTAGE + randomizedSample=True -> go through normal sampling path which applies RandomNumFn/df.sample for proper row shuffling Fixes open-metadata#21304
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR aims to fix sampling behavior in ingestion samplers so that when profileSampleType=PERCENTAGE and profileSample=100, the randomizedSample flag is respected (i.e., the dataset should still be randomized when randomizedSample=True instead of short-circuiting to the raw dataset).
Changes:
- Split the “no sampling” vs “100% percentage” short-circuit checks in the SQLAlchemy sampler
get_dataset(), gating the 100% fast-path onrandomizedSample=False. - Apply the same conditional split in the Pandas sampler
get_dataset().
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| ingestion/src/metadata/sampler/sqlalchemy/sampler.py | Adjusts 100% percentage short-circuit logic to consider randomizedSample. |
| ingestion/src/metadata/sampler/pandas/sampler.py | Adjusts 100% percentage short-circuit logic to consider randomizedSample. |
| if ( | ||
| self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE | ||
| and self.sample_config.profileSample == 100 | ||
| and not self.sample_config.randomizedSample |
There was a problem hiding this comment.
randomizedSample is typed as Optional[bool]; using not self.sample_config.randomizedSample will treat None the same as False and trigger the 100%-optimization path unexpectedly. To preserve the intended default behavior, consider checking explicitly for is False (or normalizing randomizedSample to a concrete bool when building SampleConfig).
| and not self.sample_config.randomizedSample | |
| and self.sample_config.randomizedSample is False |
| if ( | ||
| self.sample_config.profileSample == 100 | ||
| and self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE | ||
| and not self.sample_config.randomizedSample |
There was a problem hiding this comment.
randomizedSample is Optional[bool]. The condition not self.sample_config.randomizedSample will treat None as disabled randomization and return the raw dataset at 100% PERCENTAGE, which can reintroduce deterministic sampling if configs ever provide null/unset here. Consider checking self.sample_config.randomizedSample is False (or coercing None to the desired default when building SampleConfig).
| and not self.sample_config.randomizedSample | |
| and self.sample_config.randomizedSample is False |
| if not self.sample_config.profileSample: | ||
| if self.partition_details: | ||
| return self._partitioned_table() | ||
|
|
||
| return self.raw_dataset | ||
|
|
||
| if ( | ||
| self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE | ||
| and self.sample_config.profileSample == 100 | ||
| and not self.sample_config.randomizedSample |
There was a problem hiding this comment.
This PR changes behavior specifically for profileSampleType=PERCENTAGE with profileSample=100 based on randomizedSample. There are existing unit tests for both samplers, but none appear to cover the 100% PERCENTAGE edge case; adding tests for (a) 100% + randomizedSample=True producing shuffled/non-deterministic sample rows and (b) 100% + randomizedSample=False keeping the fast path would help prevent regressions.
- Fix randomizedSample check from 'not' to 'is False' in both SQASampler and DatalakeSampler to correctly handle None (Optional[bool] default=True) - Add unit tests verifying 100%% PERCENTAGE behavior for randomizedSample values True, False, and None
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…tion The get_dataset() fix ensures 100% PERCENTAGE + randomizedSample routes through get_sample_query() which produces a CTE with a random column. Now fetch_sample_data() detects that column and applies ORDER BY before LIMIT, so each call returns a different subset of rows. Also add real-DB integration tests using SQLite for the 100% PERCENTAGE edge case (True, False, None).
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
ingestion/src/metadata/sampler/sqlalchemy/sampler.py:213
- With this change,
profileSampleType=PERCENTAGEandprofileSample=100will route throughget_sample_query()wheneverrandomizedSampleis not explicitlyFalse(default isTrue).get_sample_query()adds a RandomNumFn/modulo expression + extra CTE layers even though the effective sample is still 100% of rows, andget_dataset()is used broadly for profiler metrics and data quality runners (not just sample-data display). Consider keeping the full-table fast-path for metric computation and applying randomization only when extracting sample rows (e.g., handle this edge case insidefetch_sample_data()or via a dedicated flag/entrypoint for “sample data” vs “profiling dataset”).
if not self.sample_config.profileSample:
if self.partition_details:
return self._partitioned_table()
return self.raw_dataset
if (
self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE
and self.sample_config.profileSample == 100
and self.sample_config.randomizedSample is False
):
if self.partition_details:
return self._partitioned_table()
return self.raw_dataset
return self.get_sample_query(column=column)
| # Add new RandomNumFn column | ||
| ds = self.get_dataset() | ||
| ds_columns = inspect(ds).c | ||
| random_column = next( | ||
| (col for col in ds_columns if col.name == RANDOM_LABEL), None | ||
| ) |
There was a problem hiding this comment.
The comment # Add new RandomNumFn column is no longer accurate here: get_dataset() may return the raw dataset (no random column) or a sampling CTE depending on config. Updating the comment to reflect that we’re detecting an existing random column (and only ordering when present) would avoid confusion for future maintenance.
| import pytest | ||
|
|
There was a problem hiding this comment.
pytest is imported but never used in this test module. Removing the unused import will keep the test file clean and avoids potential lint failures if/when test linting is enabled.
| import pytest |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
🛡️ TRIVY SCAN RESULT 🛡️ Target:
|
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.12.7 | 2.15.0 |
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.13.4 | 2.15.0 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42003 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4.2 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42004 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4 |
com.google.code.gson:gson |
CVE-2022-25647 | 🚨 HIGH | 2.2.4 | 2.8.9 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.3.0 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.3.0 | 3.25.5, 4.27.5, 4.28.2 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.7.1 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.7.1 | 3.25.5, 4.27.5, 4.28.2 |
com.nimbusds:nimbus-jose-jwt |
CVE-2023-52428 | 🚨 HIGH | 9.8.1 | 9.37.2 |
com.squareup.okhttp3:okhttp |
CVE-2021-0341 | 🚨 HIGH | 3.12.12 | 4.9.2 |
commons-beanutils:commons-beanutils |
CVE-2025-48734 | 🚨 HIGH | 1.9.4 | 1.11.0 |
commons-io:commons-io |
CVE-2024-47554 | 🚨 HIGH | 2.8.0 | 2.14.0 |
dnsjava:dnsjava |
CVE-2024-25638 | 🚨 HIGH | 2.1.7 | 3.6.0 |
io.airlift:aircompressor |
CVE-2025-67721 | 🚨 HIGH | 0.27 | 2.0.3 |
io.netty:netty-codec-http |
CVE-2026-33870 | 🚨 HIGH | 4.1.96.Final | 4.1.132.Final, 4.2.10.Final |
io.netty:netty-codec-http2 |
CVE-2025-55163 | 🚨 HIGH | 4.1.96.Final | 4.2.4.Final, 4.1.124.Final |
io.netty:netty-codec-http2 |
CVE-2026-33871 | 🚨 HIGH | 4.1.96.Final | 4.1.132.Final, 4.2.11.Final |
io.netty:netty-codec-http2 |
GHSA-xpw8-rcwv-8f8p | 🚨 HIGH | 4.1.96.Final | 4.1.100.Final |
io.netty:netty-handler |
CVE-2025-24970 | 🚨 HIGH | 4.1.96.Final | 4.1.118.Final |
net.minidev:json-smart |
CVE-2021-31684 | 🚨 HIGH | 1.3.2 | 1.3.3, 2.4.4 |
net.minidev:json-smart |
CVE-2023-1370 | 🚨 HIGH | 1.3.2 | 2.4.9 |
org.apache.avro:avro |
CVE-2024-47561 | 🔥 CRITICAL | 1.7.7 | 1.11.4 |
org.apache.avro:avro |
CVE-2023-39410 | 🚨 HIGH | 1.7.7 | 1.11.3 |
org.apache.derby:derby |
CVE-2022-46337 | 🔥 CRITICAL | 10.14.2.0 | 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0 |
org.apache.ivy:ivy |
CVE-2022-46751 | 🚨 HIGH | 2.5.1 | 2.5.2 |
org.apache.mesos:mesos |
CVE-2018-1330 | 🚨 HIGH | 1.4.3 | 1.6.0 |
org.apache.spark:spark-core_2.12 |
CVE-2025-54920 | 🚨 HIGH | 3.5.6 | 3.5.7 |
org.apache.thrift:libthrift |
CVE-2019-0205 | 🚨 HIGH | 0.12.0 | 0.13.0 |
org.apache.thrift:libthrift |
CVE-2020-13949 | 🚨 HIGH | 0.12.0 | 0.14.0 |
org.apache.zookeeper:zookeeper |
CVE-2023-44981 | 🔥 CRITICAL | 3.6.3 | 3.7.2, 3.8.3, 3.9.1 |
org.eclipse.jetty:jetty-server |
CVE-2024-13009 | 🚨 HIGH | 9.4.56.v20240826 | 9.4.57.v20241219 |
org.lz4:lz4-java |
CVE-2025-12183 | 🚨 HIGH | 1.8.0 | 1.8.1 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Node.js
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: Python
Vulnerabilities (24)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
Authlib |
CVE-2026-27962 | 🔥 CRITICAL | 1.6.6 | 1.6.9 |
Authlib |
CVE-2026-28490 | 🚨 HIGH | 1.6.6 | 1.6.9 |
Authlib |
CVE-2026-28498 | 🚨 HIGH | 1.6.6 | 1.6.9 |
Authlib |
CVE-2026-28802 | 🚨 HIGH | 1.6.6 | 1.6.7 |
PyJWT |
CVE-2026-32597 | 🚨 HIGH | 2.11.0 | 2.12.0 |
Werkzeug |
CVE-2024-34069 | 🚨 HIGH | 2.2.3 | 3.0.3 |
aiohttp |
CVE-2025-69223 | 🚨 HIGH | 3.12.12 | 3.13.3 |
apache-airflow |
CVE-2026-26929 | 🚨 HIGH | 3.1.7 | 3.1.8 |
apache-airflow |
CVE-2026-28779 | 🚨 HIGH | 3.1.7 | 3.1.8 |
apache-airflow |
CVE-2026-30911 | 🚨 HIGH | 3.1.7 | 3.1.8 |
apache-airflow-providers-http |
CVE-2025-69219 | 🚨 HIGH | 5.6.4 | 6.0.0 |
cryptography |
CVE-2026-26007 | 🚨 HIGH | 42.0.8 | 46.0.5 |
jaraco.context |
CVE-2026-23949 | 🚨 HIGH | 5.3.0 | 6.1.0 |
jaraco.context |
CVE-2026-23949 | 🚨 HIGH | 6.0.1 | 6.1.0 |
protobuf |
CVE-2026-0994 | 🚨 HIGH | 4.25.8 | 6.33.5, 5.29.6 |
pyOpenSSL |
CVE-2026-27459 | 🚨 HIGH | 24.1.0 | 26.0.0 |
pyasn1 |
CVE-2026-30922 | 🚨 HIGH | 0.6.2 | 0.6.3 |
ray |
CVE-2025-62593 | 🔥 CRITICAL | 2.47.1 | 2.52.0 |
starlette |
CVE-2025-62727 | 🚨 HIGH | 0.48.0 | 0.49.1 |
tornado |
CVE-2026-31958 | 🚨 HIGH | 6.5.4 | 6.5.5 |
urllib3 |
CVE-2025-66418 | 🚨 HIGH | 1.26.20 | 2.6.0 |
urllib3 |
CVE-2025-66471 | 🚨 HIGH | 1.26.20 | 2.6.0 |
urllib3 |
CVE-2026-21441 | 🚨 HIGH | 1.26.20 | 2.6.3 |
wheel |
CVE-2026-24049 | 🚨 HIGH | 0.45.1 | 0.46.2 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: usr/bin/docker
Vulnerabilities (2)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
stdlib |
CVE-2025-68121 | 🔥 CRITICAL | v1.25.6 | 1.24.13, 1.25.7, 1.26.0-rc.3 |
stdlib |
CVE-2026-25679 | 🚨 HIGH | v1.25.6 | 1.25.8, 1.26.1 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: /etc/ssl/private/ssl-cert-snakeoil.key
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️ Target:
|
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
libpng-dev |
CVE-2026-33416 | 🚨 HIGH | 1.6.39-2+deb12u3 | 1.6.39-2+deb12u4 |
libpng-dev |
CVE-2026-33636 | 🚨 HIGH | 1.6.39-2+deb12u3 | 1.6.39-2+deb12u4 |
libpng16-16 |
CVE-2026-33416 | 🚨 HIGH | 1.6.39-2+deb12u3 | 1.6.39-2+deb12u4 |
libpng16-16 |
CVE-2026-33636 | 🚨 HIGH | 1.6.39-2+deb12u3 | 1.6.39-2+deb12u4 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Java
Vulnerabilities (37)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.12.7 | 2.15.0 |
com.fasterxml.jackson.core:jackson-core |
CVE-2025-52999 | 🚨 HIGH | 2.13.4 | 2.15.0 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42003 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4.2 |
com.fasterxml.jackson.core:jackson-databind |
CVE-2022-42004 | 🚨 HIGH | 2.12.7 | 2.12.7.1, 2.13.4 |
com.google.code.gson:gson |
CVE-2022-25647 | 🚨 HIGH | 2.2.4 | 2.8.9 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.3.0 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.3.0 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.3.0 | 3.25.5, 4.27.5, 4.28.2 |
com.google.protobuf:protobuf-java |
CVE-2021-22569 | 🚨 HIGH | 3.7.1 | 3.16.1, 3.18.2, 3.19.2 |
com.google.protobuf:protobuf-java |
CVE-2022-3509 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2022-3510 | 🚨 HIGH | 3.7.1 | 3.16.3, 3.19.6, 3.20.3, 3.21.7 |
com.google.protobuf:protobuf-java |
CVE-2024-7254 | 🚨 HIGH | 3.7.1 | 3.25.5, 4.27.5, 4.28.2 |
com.nimbusds:nimbus-jose-jwt |
CVE-2023-52428 | 🚨 HIGH | 9.8.1 | 9.37.2 |
com.squareup.okhttp3:okhttp |
CVE-2021-0341 | 🚨 HIGH | 3.12.12 | 4.9.2 |
commons-beanutils:commons-beanutils |
CVE-2025-48734 | 🚨 HIGH | 1.9.4 | 1.11.0 |
commons-io:commons-io |
CVE-2024-47554 | 🚨 HIGH | 2.8.0 | 2.14.0 |
dnsjava:dnsjava |
CVE-2024-25638 | 🚨 HIGH | 2.1.7 | 3.6.0 |
io.airlift:aircompressor |
CVE-2025-67721 | 🚨 HIGH | 0.27 | 2.0.3 |
io.netty:netty-codec-http |
CVE-2026-33870 | 🚨 HIGH | 4.1.96.Final | 4.1.132.Final, 4.2.10.Final |
io.netty:netty-codec-http2 |
CVE-2025-55163 | 🚨 HIGH | 4.1.96.Final | 4.2.4.Final, 4.1.124.Final |
io.netty:netty-codec-http2 |
CVE-2026-33871 | 🚨 HIGH | 4.1.96.Final | 4.1.132.Final, 4.2.11.Final |
io.netty:netty-codec-http2 |
GHSA-xpw8-rcwv-8f8p | 🚨 HIGH | 4.1.96.Final | 4.1.100.Final |
io.netty:netty-handler |
CVE-2025-24970 | 🚨 HIGH | 4.1.96.Final | 4.1.118.Final |
net.minidev:json-smart |
CVE-2021-31684 | 🚨 HIGH | 1.3.2 | 1.3.3, 2.4.4 |
net.minidev:json-smart |
CVE-2023-1370 | 🚨 HIGH | 1.3.2 | 2.4.9 |
org.apache.avro:avro |
CVE-2024-47561 | 🔥 CRITICAL | 1.7.7 | 1.11.4 |
org.apache.avro:avro |
CVE-2023-39410 | 🚨 HIGH | 1.7.7 | 1.11.3 |
org.apache.derby:derby |
CVE-2022-46337 | 🔥 CRITICAL | 10.14.2.0 | 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0 |
org.apache.ivy:ivy |
CVE-2022-46751 | 🚨 HIGH | 2.5.1 | 2.5.2 |
org.apache.mesos:mesos |
CVE-2018-1330 | 🚨 HIGH | 1.4.3 | 1.6.0 |
org.apache.spark:spark-core_2.12 |
CVE-2025-54920 | 🚨 HIGH | 3.5.6 | 3.5.7 |
org.apache.thrift:libthrift |
CVE-2019-0205 | 🚨 HIGH | 0.12.0 | 0.13.0 |
org.apache.thrift:libthrift |
CVE-2020-13949 | 🚨 HIGH | 0.12.0 | 0.14.0 |
org.apache.zookeeper:zookeeper |
CVE-2023-44981 | 🔥 CRITICAL | 3.6.3 | 3.7.2, 3.8.3, 3.9.1 |
org.eclipse.jetty:jetty-server |
CVE-2024-13009 | 🚨 HIGH | 9.4.56.v20240826 | 9.4.57.v20241219 |
org.lz4:lz4-java |
CVE-2025-12183 | 🚨 HIGH | 1.8.0 | 1.8.1 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: Node.js
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: Python
Vulnerabilities (13)
| Package | Vulnerability ID | Severity | Installed Version | Fixed Version |
|---|---|---|---|---|
apache-airflow |
CVE-2026-26929 | 🚨 HIGH | 3.1.7 | 3.1.8 |
apache-airflow |
CVE-2026-28779 | 🚨 HIGH | 3.1.7 | 3.1.8 |
apache-airflow |
CVE-2026-30911 | 🚨 HIGH | 3.1.7 | 3.1.8 |
cryptography |
CVE-2026-26007 | 🚨 HIGH | 42.0.8 | 46.0.5 |
jaraco.context |
CVE-2026-23949 | 🚨 HIGH | 5.3.0 | 6.1.0 |
jaraco.context |
CVE-2026-23949 | 🚨 HIGH | 6.0.1 | 6.1.0 |
pyOpenSSL |
CVE-2026-27459 | 🚨 HIGH | 24.1.0 | 26.0.0 |
starlette |
CVE-2025-62727 | 🚨 HIGH | 0.48.0 | 0.49.1 |
urllib3 |
CVE-2025-66418 | 🚨 HIGH | 1.26.20 | 2.6.0 |
urllib3 |
CVE-2025-66471 | 🚨 HIGH | 1.26.20 | 2.6.0 |
urllib3 |
CVE-2026-21441 | 🚨 HIGH | 1.26.20 | 2.6.3 |
wheel |
CVE-2026-24049 | 🚨 HIGH | 0.45.1 | 0.46.2 |
wheel |
CVE-2026-24049 | 🚨 HIGH | 0.45.1 | 0.46.2 |
🛡️ TRIVY SCAN RESULT 🛡️
Target: /etc/ssl/private/ssl-cert-snakeoil.key
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/extended_sample_data.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/lineage.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data.json
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_data_aut.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage.json
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage.yaml
No Vulnerabilities Found
🛡️ TRIVY SCAN RESULT 🛡️
Target: /ingestion/pipelines/sample_usage_aut.yaml
No Vulnerabilities Found
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
TeddyCr
left a comment
There was a problem hiding this comment.
Thanks for the PR, could you take a look at the comment?
| ) | ||
| query = client.query(*select_columns).select_from(ds) | ||
| if random_column is not None: | ||
| query = query.order_by(random_column) |
There was a problem hiding this comment.
this should be moved to the sampling query itself, so this method responsibility is to fetch the sample data. see comment here https://github.com/open-metadata/OpenMetadata/pull/26966/changes#r3029273293
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
ingestion/src/metadata/sampler/sqlalchemy/sampler.py:234
fetch_sample_data()always appliesLIMITwithout anORDER BY. Even ifget_dataset()returns a sampled CTE with a random column, relying on ordering inside the CTE is not sufficient to guarantee randomized results across SQL dialects. Consider detecting the presence of the random column indsand applying an outerORDER BYon it before theLIMITis applied.
ds = self.get_dataset()
if not columns:
sqa_columns = [col for col in inspect(ds).c if col.name != RANDOM_LABEL]
else:
# we can't directly use columns as it is bound to self.raw_dataset and not the rnd table.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
ingestion/src/metadata/sampler/sqlalchemy/sampler.py:233
fetch_sample_data()appliesLIMITon the dataset returned byget_dataset()but does not apply anORDER BYon the random column when the dataset is a sampling CTE. Relying on anORDER BYinside the CTE (fromget_sample_query()) is not guaranteed to affect the outer query’s row order, so theLIMITmay still consistently return the first N rows for some dialects/optimizers.
To make the behavior deterministic across dialects, apply ORDER BY <cte>.random (or ORDER BY RANDOM_LABEL) in fetch_sample_data() when RANDOM_LABEL is present on ds and randomizedSample is enabled, before applying the LIMIT.
ds = self.get_dataset()
if not columns:
sqa_columns = [col for col in inspect(ds).c if col.name != RANDOM_LABEL]
else:
| fullyQualifiedName: FullyQualifiedEntityName | ||
| profileSample: Optional[Union[float, int]] = None | ||
| profileSampleType: Optional[ProfileSampleType] = None | ||
| samplingMethodType: Optional[SamplingMethodType] = None | ||
| sampleDataCount: Optional[int] = 100 | ||
| randomizedSample: Optional[bool] = True | ||
| randomizedSample: Optional[bool] = False | ||
|
|
There was a problem hiding this comment.
This PR changes randomizedSample defaults from True to False in the ingestion Python models (BaseProfileConfig, TableConfig, SampleConfig). That’s a behavior change well beyond “respect the flag at 100% percentage sampling” and also conflicts with the PR description/issue context that states the default is True.
If the goal is only to fix the 100% edge case, these defaults should remain True and the logic should be adjusted to respect explicit False. If the goal is to change the product default, the PR title/description and upgrade notes should explicitly call that out (and ensure UI + server-side defaults remain consistent).
| "randomizedSample": { | ||
| "description": "Whether to randomize the sample data or not.", | ||
| "type": "boolean", | ||
| "default": true | ||
| "default": false | ||
| }, |
There was a problem hiding this comment.
The JSON Schema default for randomizedSample is changed from true to false. This flips the default behavior for profiler sampling configuration and contradicts the PR description that calls out randomizedSample=True as the default.
If you’re only fixing the 100% sampling bug, keep the schema default as true and adjust the short-circuit logic accordingly. If you intend to change the default platform behavior, please update the PR description and ensure corresponding migrations/release notes cover the breaking behavior change.
| "randomizedSample": { | ||
| "description": "Whether to randomize the sample data or not.", | ||
| "type": "boolean", | ||
| "default": true | ||
| "default": false | ||
| }, |
There was a problem hiding this comment.
The schema default for tableProfilerConfig.randomizedSample is changed from true to false, which is a breaking behavioral change and not implied by the PR title (“Respect randomizedSample flag at 100% percentage sampling”). This also conflicts with the PR description stating the default is true.
Please either revert this default change (and keep the fix scoped to 100% sampling), or explicitly document this default flip and ensure it’s consistently applied across all config sources and migrations.
| -- Set randomizedSample to false where it was true (old default behavior) | ||
| UPDATE ingestion_pipeline_entity | ||
| SET json = JSON_SET(json, '$.sourceConfig.config.randomizedSample', false) | ||
| WHERE JSON_EXTRACT(json, '$.sourceConfig.config.randomizedSample') = true | ||
| AND pipelineType = 'profiler'; | ||
|
|
||
| UPDATE table_entity | ||
| SET json = JSON_SET(json, '$.tableProfilerConfig.randomizedSample', false) | ||
| WHERE JSON_EXTRACT(json, '$.tableProfilerConfig.randomizedSample') = true; | ||
|
|
||
| UPDATE database_entity | ||
| SET json = JSON_SET(json, '$.databaseProfilerConfig.randomizedSample', false) | ||
| WHERE JSON_EXTRACT(json, '$.databaseProfilerConfig.randomizedSample') = true; | ||
|
|
||
| UPDATE database_schema_entity | ||
| SET json = JSON_SET(json, '$.databaseSchemaProfilerConfig.randomizedSample', false) | ||
| WHERE JSON_EXTRACT(json, '$.databaseSchemaProfilerConfig.randomizedSample') = true; | ||
|
|
There was a problem hiding this comment.
This migration flips any stored randomizedSample=true to false, which will override existing user choices and silently change runtime behavior. If you only want a new default, avoid rewriting explicit values; migrate only missing/null fields if needed.
If the intent is to globally disable randomization, please treat this as a breaking change and document it clearly instead of silently mutating config JSON.
| -- Set randomizedSample to false where it was true (old default behavior) | |
| UPDATE ingestion_pipeline_entity | |
| SET json = JSON_SET(json, '$.sourceConfig.config.randomizedSample', false) | |
| WHERE JSON_EXTRACT(json, '$.sourceConfig.config.randomizedSample') = true | |
| AND pipelineType = 'profiler'; | |
| UPDATE table_entity | |
| SET json = JSON_SET(json, '$.tableProfilerConfig.randomizedSample', false) | |
| WHERE JSON_EXTRACT(json, '$.tableProfilerConfig.randomizedSample') = true; | |
| UPDATE database_entity | |
| SET json = JSON_SET(json, '$.databaseProfilerConfig.randomizedSample', false) | |
| WHERE JSON_EXTRACT(json, '$.databaseProfilerConfig.randomizedSample') = true; | |
| UPDATE database_schema_entity | |
| SET json = JSON_SET(json, '$.databaseSchemaProfilerConfig.randomizedSample', false) | |
| WHERE JSON_EXTRACT(json, '$.databaseSchemaProfilerConfig.randomizedSample') = true; | |
| -- Preserve existing randomizedSample values during migration. | |
| -- Any new default for randomizedSample must be applied only to missing/null | |
| -- fields in application logic or a separate, explicitly documented migration. |
| def test_full_percentage_none_randomized_skips_sample_query(self, sampler_mock): | ||
| """100% PERCENTAGE + randomizedSample=None should short-circuit | ||
| (only explicit True enables randomization).""" | ||
| with patch.object(SQASampler, "build_table_orm", return_value=User): | ||
| sampler = SQASampler( | ||
| service_connection_config=self.sqlite_conn, | ||
| ometa_client=None, | ||
| entity=None, | ||
| sample_config=SampleConfig( | ||
| profileSampleType=ProfileSampleType.PERCENTAGE, | ||
| profileSample=100, | ||
| randomizedSample=None, | ||
| ), |
There was a problem hiding this comment.
These new tests assert that randomizedSample=None should short-circuit (i.e., behave like False). This conflicts with the PR description’s stated intent of handling Optional[bool] correctly by only skipping randomization when randomizedSample is False (so None follows the default behavior).
Either adjust the implementation to treat None as “use default” (and update these tests accordingly), or explicitly document and enforce the new “opt-in randomization” contract across schema defaults, models, and migration strategy.
| "randomizedSample": { | ||
| "description": "Whether to randomize the sample data or not.", | ||
| "type": "boolean", | ||
| "default": true | ||
| "default": false | ||
| } |
There was a problem hiding this comment.
databaseProfilerConfig.randomizedSample default is changed from true to false, which flips default sampling behavior for database-level profiler configs. This is a breaking behavior change and is not reflected in the PR title/description (which states the default is true).
Please either revert this default flip (keeping the PR scoped to the 100% sampling bug) or explicitly document the behavior change and ensure all layers (schemas, ingestion models, UI defaults, migrations) are updated consistently.
| "randomizedSample": { | ||
| "description": "Whether to randomize the sample data or not.", | ||
| "type": "boolean", | ||
| "default": true | ||
| "default": false | ||
| } |
There was a problem hiding this comment.
databaseSchemaProfilerConfig.randomizedSample default is changed from true to false, which flips default sampling behavior for schema-level profiler configs. This is a breaking behavior change and conflicts with the PR description’s claim that the default is true.
Please clarify whether the PR is intended to change defaults platform-wide; if not, revert this schema default change and keep the fix limited to respecting explicit randomizedSample at 100% PERCENTAGE.
|
Hi Sir @TeddyCr ! I've addressed all your requested changes:
However, flipping the default from 1. Two test cases have Proposed fix: Remove the 2. TypeScript snapshot — TS type-generation test failing Changing Proposed fix: Regenerate and update the affected 3. SQL migration — data integrity concern (not blocking CI, but important) The migration scripts for both MySQL and PostgreSQL currently have: WHERE JSON_EXTRACT(json, '$.randomizedSample') = trueThe intent is presumably to backfill legacy rows (rows that were created before this field existed) with the new default. However, using Proposed fix: Change the condition to Planning to apply all three fixes. Let me know if you'd prefer a different approach for any of them, especially the migration — happy to drop it entirely if you think it's out of scope for this PR. |
|
@RajdeepKushwaha5 can you address the test failures please |
BigQuery sampling tests create SampleConfig without setting randomizedSample, which now defaults to False. Since ORDER BY is only added when randomizedSample is True, the expected query strings should not include ORDER BY. Also fix inaccurate docstring in test_sample.py.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (2)
ingestion/src/metadata/sampler/models.py:132
SampleConfig.randomizedSampledefault is changed toFalse, which alters sampling behavior for any call sites that rely onSampleConfig()defaults (including when no per-entity config is found). If randomization is still meant to be the default, revert this toTrueand treatNoneas default behavior rather than asFalse.
class SampleConfig(ConfigModel):
"""Profile Sample Config"""
profileSample: Optional[Union[float, int]] = None
profileSampleType: Optional[ProfileSampleType] = ProfileSampleType.PERCENTAGE
samplingMethodType: Optional[SamplingMethodType] = None
randomizedSample: Optional[bool] = False
ingestion/src/metadata/sampler/sqlalchemy/sampler.py:214
- The 100% PERCENTAGE short-circuit uses
randomizedSample is not True, which treatsNonethe same asFalse. This contradicts the PR description’s statedOptional[bool]semantics (only skip when explicitly disabled). IfNoneshould behave like the default, use an explicitis Falsecheck for the short-circuit (and treatNoneas "randomize").
if (
self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE
and self.sample_config.profileSample == 100
and self.sample_config.randomizedSample is not True
):
if self.partition_details:
return self._partitioned_table()
return self.raw_dataset
Increase fetch_sample_data loop from 10 to 20 iterations to further reduce the theoretical probability of a false failure in the randomized ordering test.
Code Review 👍 Approved with suggestions 3 resolved / 4 findingsFixes the randomizedSample flag to be respected at 100% sampling by resolving duplicate assertions, integration test contradictions, and docstring mismatches. Consider refactoring the non-determinism test to reduce flakiness inherent in its probabilistic assertions. 💡 Edge Case: Non-determinism test is inherently flaky📄 ingestion/tests/unit/observability/profiler/sqlalchemy/test_sample.py:451-454 The test ✅ 3 resolved✅ Quality: Duplicate assertions in test methods
✅ Bug: Integration test contradicts code: None case will fail
✅ Quality: Test docstring contradicts model default for randomizedSample
🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|



Describe your changes:
Fixes #21304
When
profileSampleis set to 100% withPERCENTAGEtype, both the SQLAlchemy and Pandas samplers short-circuit inget_dataset()and return the raw dataset directly, ignoring therandomizedSampleflag entirely. This means that even whenrandomizedSample=True(the default), data is returned in its natural order instead of being shuffled — defeating the purpose of randomized sampling for AutoClassifier.Root Cause:
The original condition combined the "no sample configured" and "100% percentage" checks into a single
orclause, which always bypassed the sampling path at 100% regardless of therandomizedSamplesetting.What changed:
get_dataset()— both SQLAlchemy and Pandas samplers: Split the 100% short-circuit condition to gate onrandomizedSample is False(notnot randomizedSample), so thatTrueandNone(the default) correctly fall through to the randomized sampling path.fetch_sample_data()— SQLAlchemy sampler: Whenget_dataset()returns a sampling CTE with a random column,fetch_sample_data()now detects it and appliesORDER BYon that column beforeLIMIT, so each call returns a different subset of rows instead of the same first N.Optional[bool]correctness:randomizedSampleis typed asOptional[bool]with defaultTrue. Usingnotwould treatNonethe same asFalse, incorrectly triggering the optimization path. We useis Falseto only skip randomization when explicitly disabled.Key advantages of this approach:
ORDER BY RandomNumFn()which compiles to a constant"0"on Snowflake/Teradata)Optional[bool]handling —is Falseinstead ofnot, preserving backward compatibility whenrandomizedSampleis unset (None)How I tested:
randomizedSample=True,False,None) verifyingget_sample_queryis called/skipped correctlyget_dataset()short-circuit logicis Falsevsnotbehavior differenceFiles changed:
ingestion/src/metadata/sampler/sqlalchemy/sampler.py—get_dataset()condition fix +fetch_sample_data()ORDER BYingestion/src/metadata/sampler/pandas/sampler.py—get_dataset()condition fixingestion/tests/unit/observability/profiler/sqlalchemy/test_sample.py— 3 real-DB integration testsingestion/tests/unit/sampler/test_sampler_100_pct.py— 6 mock-based unit tests (3 SQA + 3 Pandas)Type of change:
Checklist:
Fixes <issue-number>: <short explanation>