Skip to content

fix(sampler): Respect randomizedSample flag at 100% percentage sampling#26966

Merged
TeddyCr merged 23 commits intoopen-metadata:mainfrom
RajdeepKushwaha5:fix/21304-sampler-randomization
Apr 14, 2026
Merged

fix(sampler): Respect randomizedSample flag at 100% percentage sampling#26966
TeddyCr merged 23 commits intoopen-metadata:mainfrom
RajdeepKushwaha5:fix/21304-sampler-randomization

Conversation

@RajdeepKushwaha5
Copy link
Copy Markdown
Contributor

@RajdeepKushwaha5 RajdeepKushwaha5 commented Apr 2, 2026

Describe your changes:

Fixes #21304

When profileSample is set to 100% with PERCENTAGE type, both the SQLAlchemy and Pandas samplers short-circuit in get_dataset() and return the raw dataset directly, ignoring the randomizedSample flag entirely. This means that even when randomizedSample=True (the default), data is returned in its natural order instead of being shuffled — defeating the purpose of randomized sampling for AutoClassifier.

Root Cause:
The original condition combined the "no sample configured" and "100% percentage" checks into a single or clause, which always bypassed the sampling path at 100% regardless of the randomizedSample setting.

What changed:

  1. get_dataset() — both SQLAlchemy and Pandas samplers: Split the 100% short-circuit condition to gate on randomizedSample is False (not not randomizedSample), so that True and None (the default) correctly fall through to the randomized sampling path.

  2. fetch_sample_data() — SQLAlchemy sampler: When get_dataset() returns a sampling CTE with a random column, fetch_sample_data() now detects it and applies ORDER BY on that column before LIMIT, so each call returns a different subset of rows instead of the same first N.

  3. Optional[bool] correctness: randomizedSample is typed as Optional[bool] with default True. Using not would treat None the same as False, incorrectly triggering the optimization path. We use is False to only skip randomization when explicitly disabled.

Key advantages of this approach:

  • Fixes both SQLAlchemy AND Pandas samplers — the Pandas sampler had the same bug but is overlooked by other PRs targeting this issue
  • ORDER BY uses the CTE's existing random column — no dialect compatibility issues (unlike ORDER BY RandomNumFn() which compiles to a constant "0" on Snowflake/Teradata)
  • Minimal, focused changes — no new methods or abstractions, just corrected conditions and proper ORDER BY
  • Correct Optional[bool] handlingis False instead of not, preserving backward compatibility when randomizedSample is unset (None)

How I tested:

  • Real-DB integration tests using SQLite for the 100% PERCENTAGE edge case (randomizedSample=True, False, None) verifying get_sample_query is called/skipped correctly
  • Mock-based unit tests for both SQASampler and DatalakeSampler get_dataset() short-circuit logic
  • Standalone algorithm validation for the is False vs not behavior difference

Files changed:

  • ingestion/src/metadata/sampler/sqlalchemy/sampler.pyget_dataset() condition fix + fetch_sample_data() ORDER BY
  • ingestion/src/metadata/sampler/pandas/sampler.pyget_dataset() condition fix
  • ingestion/tests/unit/observability/profiler/sqlalchemy/test_sample.py — 3 real-DB integration tests
  • ingestion/tests/unit/sampler/test_sampler_100_pct.py — 6 mock-based unit tests (3 SQA + 3 Pandas)

Type of change:

  • Bug fix

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

When profileSample is 100% with PERCENTAGE type, the sampler
short-circuits and returns the raw dataset without any randomization,
even when randomizedSample is True (the default).

Split the combined condition so:
- No profileSample set -> return raw dataset (no sampling configured)
- 100% PERCENTAGE + randomizedSample=False -> return raw dataset (optimization)
- 100% PERCENTAGE + randomizedSample=True -> go through normal sampling path
  which applies RandomNumFn/df.sample for proper row shuffling

Fixes open-metadata#21304
Copilot AI review requested due to automatic review settings April 2, 2026 09:34
@RajdeepKushwaha5 RajdeepKushwaha5 requested a review from a team as a code owner April 2, 2026 09:34
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to fix sampling behavior in ingestion samplers so that when profileSampleType=PERCENTAGE and profileSample=100, the randomizedSample flag is respected (i.e., the dataset should still be randomized when randomizedSample=True instead of short-circuiting to the raw dataset).

Changes:

  • Split the “no sampling” vs “100% percentage” short-circuit checks in the SQLAlchemy sampler get_dataset(), gating the 100% fast-path on randomizedSample=False.
  • Apply the same conditional split in the Pandas sampler get_dataset().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
ingestion/src/metadata/sampler/sqlalchemy/sampler.py Adjusts 100% percentage short-circuit logic to consider randomizedSample.
ingestion/src/metadata/sampler/pandas/sampler.py Adjusts 100% percentage short-circuit logic to consider randomizedSample.

Comment thread ingestion/src/metadata/sampler/sqlalchemy/sampler.py
if (
self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE
and self.sample_config.profileSample == 100
and not self.sample_config.randomizedSample
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randomizedSample is typed as Optional[bool]; using not self.sample_config.randomizedSample will treat None the same as False and trigger the 100%-optimization path unexpectedly. To preserve the intended default behavior, consider checking explicitly for is False (or normalizing randomizedSample to a concrete bool when building SampleConfig).

Suggested change
and not self.sample_config.randomizedSample
and self.sample_config.randomizedSample is False

Copilot uses AI. Check for mistakes.
if (
self.sample_config.profileSample == 100
and self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE
and not self.sample_config.randomizedSample
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

randomizedSample is Optional[bool]. The condition not self.sample_config.randomizedSample will treat None as disabled randomization and return the raw dataset at 100% PERCENTAGE, which can reintroduce deterministic sampling if configs ever provide null/unset here. Consider checking self.sample_config.randomizedSample is False (or coercing None to the desired default when building SampleConfig).

Suggested change
and not self.sample_config.randomizedSample
and self.sample_config.randomizedSample is False

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +206
if not self.sample_config.profileSample:
if self.partition_details:
return self._partitioned_table()

return self.raw_dataset

if (
self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE
and self.sample_config.profileSample == 100
and not self.sample_config.randomizedSample
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes behavior specifically for profileSampleType=PERCENTAGE with profileSample=100 based on randomizedSample. There are existing unit tests for both samplers, but none appear to cover the 100% PERCENTAGE edge case; adding tests for (a) 100% + randomizedSample=True producing shuffled/non-deterministic sample rows and (b) 100% + randomizedSample=False keeping the fast path would help prevent regressions.

Copilot uses AI. Check for mistakes.
- Fix randomizedSample check from 'not' to 'is False' in both SQASampler
  and DatalakeSampler to correctly handle None (Optional[bool] default=True)
- Add unit tests verifying 100%% PERCENTAGE behavior for randomizedSample
  values True, False, and None
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…tion

The get_dataset() fix ensures 100% PERCENTAGE + randomizedSample routes
through get_sample_query() which produces a CTE with a random column.
Now fetch_sample_data() detects that column and applies ORDER BY before
LIMIT, so each call returns a different subset of rows.

Also add real-DB integration tests using SQLite for the 100% PERCENTAGE
edge case (True, False, None).
Copilot AI review requested due to automatic review settings April 2, 2026 12:49
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

ingestion/src/metadata/sampler/sqlalchemy/sampler.py:213

  • With this change, profileSampleType=PERCENTAGE and profileSample=100 will route through get_sample_query() whenever randomizedSample is not explicitly False (default is True). get_sample_query() adds a RandomNumFn/modulo expression + extra CTE layers even though the effective sample is still 100% of rows, and get_dataset() is used broadly for profiler metrics and data quality runners (not just sample-data display). Consider keeping the full-table fast-path for metric computation and applying randomization only when extracting sample rows (e.g., handle this edge case inside fetch_sample_data() or via a dedicated flag/entrypoint for “sample data” vs “profiling dataset”).
        if not self.sample_config.profileSample:
            if self.partition_details:
                return self._partitioned_table()

            return self.raw_dataset

        if (
            self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE
            and self.sample_config.profileSample == 100
            and self.sample_config.randomizedSample is False
        ):
            if self.partition_details:
                return self._partitioned_table()

            return self.raw_dataset

        return self.get_sample_query(column=column)

Comment on lines +227 to +232
# Add new RandomNumFn column
ds = self.get_dataset()
ds_columns = inspect(ds).c
random_column = next(
(col for col in ds_columns if col.name == RANDOM_LABEL), None
)
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment # Add new RandomNumFn column is no longer accurate here: get_dataset() may return the raw dataset (no random column) or a sampling CTE depending on config. Updating the comment to reflect that we’re detecting an existing random column (and only ordering when present) would avoid confusion for future maintenance.

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +20
import pytest

Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is imported but never used in this test module. Removing the unused import will keep the test file clean and avoids potential lint failures if/when test linting is enabled.

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Comment thread ingestion/tests/unit/sampler/test_sampler_100_pct.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.13)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (37)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http CVE-2026-33870 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.10.Final
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 CVE-2026-33871 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.11.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.spark:spark-core_2.12 CVE-2025-54920 🚨 HIGH 3.5.6 3.5.7
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (24)

Package Vulnerability ID Severity Installed Version Fixed Version
Authlib CVE-2026-27962 🔥 CRITICAL 1.6.6 1.6.9
Authlib CVE-2026-28490 🚨 HIGH 1.6.6 1.6.9
Authlib CVE-2026-28498 🚨 HIGH 1.6.6 1.6.9
Authlib CVE-2026-28802 🚨 HIGH 1.6.6 1.6.7
PyJWT CVE-2026-32597 🚨 HIGH 2.11.0 2.12.0
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
apache-airflow CVE-2026-26929 🚨 HIGH 3.1.7 3.1.8
apache-airflow CVE-2026-28779 🚨 HIGH 3.1.7 3.1.8
apache-airflow CVE-2026-30911 🚨 HIGH 3.1.7 3.1.8
apache-airflow-providers-http CVE-2025-69219 🚨 HIGH 5.6.4 6.0.0
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
protobuf CVE-2026-0994 🚨 HIGH 4.25.8 6.33.5, 5.29.6
pyOpenSSL CVE-2026-27459 🚨 HIGH 24.1.0 26.0.0
pyasn1 CVE-2026-30922 🚨 HIGH 0.6.2 0.6.3
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
tornado CVE-2026-31958 🚨 HIGH 6.5.4 6.5.5
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: usr/bin/docker

Vulnerabilities (2)

Package Vulnerability ID Severity Installed Version Fixed Version
stdlib CVE-2025-68121 🔥 CRITICAL v1.25.6 1.24.13, 1.25.7, 1.26.0-rc.3
stdlib CVE-2026-25679 🚨 HIGH v1.25.6 1.25.8, 1.26.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpng-dev CVE-2026-33416 🚨 HIGH 1.6.39-2+deb12u3 1.6.39-2+deb12u4
libpng-dev CVE-2026-33636 🚨 HIGH 1.6.39-2+deb12u3 1.6.39-2+deb12u4
libpng16-16 CVE-2026-33416 🚨 HIGH 1.6.39-2+deb12u3 1.6.39-2+deb12u4
libpng16-16 CVE-2026-33636 🚨 HIGH 1.6.39-2+deb12u3 1.6.39-2+deb12u4

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (37)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.airlift:aircompressor CVE-2025-67721 🚨 HIGH 0.27 2.0.3
io.netty:netty-codec-http CVE-2026-33870 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.10.Final
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 CVE-2026-33871 🚨 HIGH 4.1.96.Final 4.1.132.Final, 4.2.11.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.spark:spark-core_2.12 CVE-2025-54920 🚨 HIGH 3.5.6 3.5.7
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (13)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2026-26929 🚨 HIGH 3.1.7 3.1.8
apache-airflow CVE-2026-28779 🚨 HIGH 3.1.7 3.1.8
apache-airflow CVE-2026-30911 🚨 HIGH 3.1.7 3.1.8
cryptography CVE-2026-26007 🚨 HIGH 42.0.8 46.0.5
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
pyOpenSSL CVE-2026-27459 🚨 HIGH 24.1.0 26.0.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 2, 2026 17:15
Copy link
Copy Markdown
Collaborator

@TeddyCr TeddyCr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, could you take a look at the comment?

Comment thread ingestion/src/metadata/sampler/sqlalchemy/sampler.py
)
query = client.query(*select_columns).select_from(ds)
if random_column is not None:
query = query.order_by(random_column)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be moved to the sampling query itself, so this method responsibility is to fetch the sample data. see comment here https://github.com/open-metadata/OpenMetadata/pull/26966/changes#r3029273293

Comment thread ingestion/tests/unit/sampler/test_sampler_100_pct.py Outdated
Copilot AI review requested due to automatic review settings April 3, 2026 18:09
TeddyCr
TeddyCr previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

ingestion/src/metadata/sampler/sqlalchemy/sampler.py:234

  • fetch_sample_data() always applies LIMIT without an ORDER BY. Even if get_dataset() returns a sampled CTE with a random column, relying on ordering inside the CTE is not sufficient to guarantee randomized results across SQL dialects. Consider detecting the presence of the random column in ds and applying an outer ORDER BY on it before the LIMIT is applied.
        ds = self.get_dataset()
        if not columns:
            sqa_columns = [col for col in inspect(ds).c if col.name != RANDOM_LABEL]
        else:
            # we can't directly use columns as it is bound to self.raw_dataset and not the rnd table.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

ingestion/src/metadata/sampler/sqlalchemy/sampler.py:233

  • fetch_sample_data() applies LIMIT on the dataset returned by get_dataset() but does not apply an ORDER BY on the random column when the dataset is a sampling CTE. Relying on an ORDER BY inside the CTE (from get_sample_query()) is not guaranteed to affect the outer query’s row order, so the LIMIT may still consistently return the first N rows for some dialects/optimizers.

To make the behavior deterministic across dialects, apply ORDER BY <cte>.random (or ORDER BY RANDOM_LABEL) in fetch_sample_data() when RANDOM_LABEL is present on ds and randomizedSample is enabled, before applying the LIMIT.

        ds = self.get_dataset()
        if not columns:
            sqa_columns = [col for col in inspect(ds).c if col.name != RANDOM_LABEL]
        else:

Comment on lines 41 to 47
fullyQualifiedName: FullyQualifiedEntityName
profileSample: Optional[Union[float, int]] = None
profileSampleType: Optional[ProfileSampleType] = None
samplingMethodType: Optional[SamplingMethodType] = None
sampleDataCount: Optional[int] = 100
randomizedSample: Optional[bool] = True
randomizedSample: Optional[bool] = False

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes randomizedSample defaults from True to False in the ingestion Python models (BaseProfileConfig, TableConfig, SampleConfig). That’s a behavior change well beyond “respect the flag at 100% percentage sampling” and also conflicts with the PR description/issue context that states the default is True.

If the goal is only to fix the 100% edge case, these defaults should remain True and the logic should be adjusted to respect explicit False. If the goal is to change the product default, the PR title/description and upgrade notes should explicitly call that out (and ensure UI + server-side defaults remain consistent).

Copilot uses AI. Check for mistakes.
Comment on lines 103 to 107
"randomizedSample": {
"description": "Whether to randomize the sample data or not.",
"type": "boolean",
"default": true
"default": false
},
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON Schema default for randomizedSample is changed from true to false. This flips the default behavior for profiler sampling configuration and contradicts the PR description that calls out randomizedSample=True as the default.

If you’re only fixing the 100% sampling bug, keep the schema default as true and adjust the short-circuit logic accordingly. If you intend to change the default platform behavior, please update the PR description and ensure corresponding migrations/release notes cover the breaking behavior change.

Copilot uses AI. Check for mistakes.
Comment on lines 859 to 863
"randomizedSample": {
"description": "Whether to randomize the sample data or not.",
"type": "boolean",
"default": true
"default": false
},
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema default for tableProfilerConfig.randomizedSample is changed from true to false, which is a breaking behavioral change and not implied by the PR title (“Respect randomizedSample flag at 100% percentage sampling”). This also conflicts with the PR description stating the default is true.

Please either revert this default change (and keep the fix scoped to 100% sampling), or explicitly document this default flip and ensure it’s consistently applied across all config sources and migrations.

Copilot uses AI. Check for mistakes.
Comment on lines +6 to +23
-- Set randomizedSample to false where it was true (old default behavior)
UPDATE ingestion_pipeline_entity
SET json = JSON_SET(json, '$.sourceConfig.config.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.sourceConfig.config.randomizedSample') = true
AND pipelineType = 'profiler';

UPDATE table_entity
SET json = JSON_SET(json, '$.tableProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.tableProfilerConfig.randomizedSample') = true;

UPDATE database_entity
SET json = JSON_SET(json, '$.databaseProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.databaseProfilerConfig.randomizedSample') = true;

UPDATE database_schema_entity
SET json = JSON_SET(json, '$.databaseSchemaProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.databaseSchemaProfilerConfig.randomizedSample') = true;

Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migration flips any stored randomizedSample=true to false, which will override existing user choices and silently change runtime behavior. If you only want a new default, avoid rewriting explicit values; migrate only missing/null fields if needed.

If the intent is to globally disable randomization, please treat this as a breaking change and document it clearly instead of silently mutating config JSON.

Suggested change
-- Set randomizedSample to false where it was true (old default behavior)
UPDATE ingestion_pipeline_entity
SET json = JSON_SET(json, '$.sourceConfig.config.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.sourceConfig.config.randomizedSample') = true
AND pipelineType = 'profiler';
UPDATE table_entity
SET json = JSON_SET(json, '$.tableProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.tableProfilerConfig.randomizedSample') = true;
UPDATE database_entity
SET json = JSON_SET(json, '$.databaseProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.databaseProfilerConfig.randomizedSample') = true;
UPDATE database_schema_entity
SET json = JSON_SET(json, '$.databaseSchemaProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.databaseSchemaProfilerConfig.randomizedSample') = true;
-- Preserve existing randomizedSample values during migration.
-- Any new default for randomizedSample must be applied only to missing/null
-- fields in application logic or a separate, explicitly documented migration.

Copilot uses AI. Check for mistakes.
Comment on lines +413 to +425
def test_full_percentage_none_randomized_skips_sample_query(self, sampler_mock):
"""100% PERCENTAGE + randomizedSample=None should short-circuit
(only explicit True enables randomization)."""
with patch.object(SQASampler, "build_table_orm", return_value=User):
sampler = SQASampler(
service_connection_config=self.sqlite_conn,
ometa_client=None,
entity=None,
sample_config=SampleConfig(
profileSampleType=ProfileSampleType.PERCENTAGE,
profileSample=100,
randomizedSample=None,
),
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new tests assert that randomizedSample=None should short-circuit (i.e., behave like False). This conflicts with the PR description’s stated intent of handling Optional[bool] correctly by only skipping randomization when randomizedSample is False (so None follows the default behavior).

Either adjust the implementation to treat None as “use default” (and update these tests accordingly), or explicitly document and enforce the new “opt-in randomization” contract across schema defaults, models, and migration strategy.

Copilot uses AI. Check for mistakes.
Comment on lines 175 to 179
"randomizedSample": {
"description": "Whether to randomize the sample data or not.",
"type": "boolean",
"default": true
"default": false
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

databaseProfilerConfig.randomizedSample default is changed from true to false, which flips default sampling behavior for database-level profiler configs. This is a breaking behavior change and is not reflected in the PR title/description (which states the default is true).

Please either revert this default flip (keeping the PR scoped to the 100% sampling bug) or explicitly document the behavior change and ensure all layers (schemas, ingestion models, UI defaults, migrations) are updated consistently.

Copilot uses AI. Check for mistakes.
Comment on lines 171 to 175
"randomizedSample": {
"description": "Whether to randomize the sample data or not.",
"type": "boolean",
"default": true
"default": false
}
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

databaseSchemaProfilerConfig.randomizedSample default is changed from true to false, which flips default sampling behavior for schema-level profiler configs. This is a breaking behavior change and conflicts with the PR description’s claim that the default is true.

Please clarify whether the PR is intended to change defaults platform-wide; if not, revert this schema default change and keep the fix limited to respecting explicit randomizedSample at 100% PERCENTAGE.

Copilot uses AI. Check for mistakes.
@RajdeepKushwaha5
Copy link
Copy Markdown
Contributor Author

Hi Sir @TeddyCr ! I've addressed all your requested changes:

  • randomizedSample now defaults to False across BaseProfileConfig, TableConfig, and SampleConfig
  • Randomization only activates on an explicit True — the guard uses is not True, so both None and False are treated the same and skip randomization
  • ORDER BY rnd.c.random is placed inside get_sample_query(), gated behind randomizedSample is True, as you suggested
  • JSON schemas updated with "default": false on all affected entities (table.json, database.json, databaseSchema.json, databaseServiceProfilerPipeline.json)
  • Behavioral tests added for the True / False / None cases and for non-determinism verification

However, flipping the default from TrueFalse cascaded into a few breakages I want to flag before pushing further. Three things currently failing:


1. test_bigquery_sampling.py — Python unit tests failing

Two test cases have ORDER BY "...rnd".random added to their expected query strings. But both of those tests construct a plain SampleConfig(profileSampleType=..., profileSample=50.0) with no randomizedSample argument — meaning they inherit the new default of False. With False, get_sample_query() does not emit the ORDER BY, so the expected strings don't match.

Proposed fix: Remove the ORDER BY "...rnd".random line from both expected strings in those two test cases. Those tests are not testing randomization behavior — they were testing sampling percentages — so the expected query should not include ORDER BY.


2. TypeScript snapshot — TS type-generation test failing

Changing "default": true"default": false in the JSON schemas caused a TypeScript snapshot test to fail because the snapshot was previously generated with the old default of true and no longer matches.

Proposed fix: Regenerate and update the affected .snap file so it reflects "default": false. This is a mechanical update with no logic change.


3. SQL migration — data integrity concern (not blocking CI, but important)

The migration scripts for both MySQL and PostgreSQL currently have:

WHERE JSON_EXTRACT(json, '$.randomizedSample') = true

The intent is presumably to backfill legacy rows (rows that were created before this field existed) with the new default. However, using = true means this UPDATE would also overwrite rows where a user had explicitly set randomizedSample: true. That silently destroys their configuration.

Proposed fix: Change the condition to IS NULL in both the MySQL and PostgreSQL migration scripts, so only rows that have no existing value for randomizedSample are touched. Rows that already carry an explicit value — whether true or false — are left untouched.


Planning to apply all three fixes. Let me know if you'd prefer a different approach for any of them, especially the migration — happy to drop it entirely if you think it's out of scope for this PR.

@TeddyCr
Copy link
Copy Markdown
Collaborator

TeddyCr commented Apr 9, 2026

@RajdeepKushwaha5 can you address the test failures please

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.

RajdeepKushwaha5 and others added 2 commits April 12, 2026 00:34
BigQuery sampling tests create SampleConfig without setting
randomizedSample, which now defaults to False. Since ORDER BY
is only added when randomizedSample is True, the expected query
strings should not include ORDER BY.

Also fix inaccurate docstring in test_sample.py.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Comments suppressed due to low confidence (2)

ingestion/src/metadata/sampler/models.py:132

  • SampleConfig.randomizedSample default is changed to False, which alters sampling behavior for any call sites that rely on SampleConfig() defaults (including when no per-entity config is found). If randomization is still meant to be the default, revert this to True and treat None as default behavior rather than as False.
class SampleConfig(ConfigModel):
    """Profile Sample Config"""

    profileSample: Optional[Union[float, int]] = None
    profileSampleType: Optional[ProfileSampleType] = ProfileSampleType.PERCENTAGE
    samplingMethodType: Optional[SamplingMethodType] = None
    randomizedSample: Optional[bool] = False

ingestion/src/metadata/sampler/sqlalchemy/sampler.py:214

  • The 100% PERCENTAGE short-circuit uses randomizedSample is not True, which treats None the same as False. This contradicts the PR description’s stated Optional[bool] semantics (only skip when explicitly disabled). If None should behave like the default, use an explicit is False check for the short-circuit (and treat None as "randomize").
        if (
            self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE
            and self.sample_config.profileSample == 100
            and self.sample_config.randomizedSample is not True
        ):
            if self.partition_details:
                return self._partitioned_table()

            return self.raw_dataset

Comment thread ingestion/src/metadata/sampler/models.py
Comment thread ingestion/src/metadata/sampler/pandas/sampler.py
Comment thread ingestion/src/metadata/sampler/sqlalchemy/sampler.py
Comment thread ingestion/src/metadata/sampler/sqlalchemy/sampler.py
Increase fetch_sample_data loop from 10 to 20 iterations to further
reduce the theoretical probability of a false failure in the
randomized ordering test.
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 11, 2026

Code Review 👍 Approved with suggestions 3 resolved / 4 findings

Fixes the randomizedSample flag to be respected at 100% sampling by resolving duplicate assertions, integration test contradictions, and docstring mismatches. Consider refactoring the non-determinism test to reduce flakiness inherent in its probabilistic assertions.

💡 Edge Case: Non-determinism test is inherently flaky

📄 ingestion/tests/unit/observability/profiler/sqlalchemy/test_sample.py:451-454

The test test_randomized_true_produces_non_deterministic_rows (line 451-454) asserts that at least one of 5 fetch_sample_data() calls returns a different row ordering. With only 5 rows and 5 iterations, there's a small but non-zero probability that all iterations happen to produce the same ordering, especially since ABS(RANDOM()) % 100 can produce duplicate values. This would cause a spurious test failure. Consider increasing the iteration count or using a larger dataset to reduce flakiness.

✅ 3 resolved
Quality: Duplicate assertions in test methods

📄 ingestion/tests/unit/sampler/test_sampler_100_pct.py:104-107 📄 ingestion/tests/unit/sampler/test_sampler_100_pct.py:121-122
Two test methods contain copy-pasted duplicate assertions:

  • test_100_pct_randomized_true_delegates_to_sampled_dataframe: lines 106-107 are exact duplicates of lines 104-105
  • test_100_pct_randomized_none_delegates_to_sampled_dataframe: line 122 duplicates line 121

These don't cause failures but are unnecessary noise.

Bug: Integration test contradicts code: None case will fail

📄 ingestion/tests/unit/observability/profiler/sqlalchemy/test_sample.py:413-427 📄 ingestion/src/metadata/sampler/sqlalchemy/sampler.py:206-209 📄 ingestion/tests/unit/sampler/test_sampler_100_pct.py:63-68 📄 ingestion/src/metadata/sampler/sqlalchemy/sampler.py:209 📄 ingestion/src/metadata/sampler/pandas/sampler.py:116
The integration test test_full_percentage_none_randomized_uses_sample_query asserts that randomizedSample=None should call get_sample_query (i.e., NOT short-circuit), with the docstring stating "None is not False". However, the actual get_dataset() condition uses not self.sample_config.randomizedSample, and not None evaluates to True — meaning None IS treated the same as False and WILL short-circuit to the raw dataset.

This directly contradicts the assertion assert mock_gsq.called, so this test will fail at runtime.

The unit test test_100_pct_randomized_none_returns_raw_dataset correctly expects short-circuit behavior for None, confirming the integration test is the one that's wrong.

Either:

  1. The integration test should be fixed to match the not behavior (assert not mock_gsq.called), or
  2. The condition should use is False instead of not to distinguish None from False — but then the unit test needs updating.
Quality: Test docstring contradicts model default for randomizedSample

📄 ingestion/tests/unit/sampler/test_sampler_100_pct.py:15-16 📄 ingestion/tests/unit/sampler/test_sampler_100_pct.py:63 📄 ingestion/tests/unit/sampler/test_sampler_100_pct.py:115
The test docstrings in test_sampler_100_pct.py (lines 15-16, 63, 115) state "None defaults to False" but SampleConfig.randomizedSample is defined as Optional[bool] = True in models.py:130. When None is explicitly passed, not None evaluates to True (short-circuits), which is the tested behavior — but the comment suggests None is the default, which is misleading since the actual default is True. Consider updating the docstrings to say something like "None is treated as non-randomized (falsy)" to avoid confusion.

🤖 Prompt for agents
Code Review: Fixes the randomizedSample flag to be respected at 100% sampling by resolving duplicate assertions, integration test contradictions, and docstring mismatches. Consider refactoring the non-determinism test to reduce flakiness inherent in its probabilistic assertions.

1. 💡 Edge Case: Non-determinism test is inherently flaky
   Files: ingestion/tests/unit/observability/profiler/sqlalchemy/test_sample.py:451-454

   The test `test_randomized_true_produces_non_deterministic_rows` (line 451-454) asserts that at least one of 5 `fetch_sample_data()` calls returns a different row ordering. With only 5 rows and 5 iterations, there's a small but non-zero probability that all iterations happen to produce the same ordering, especially since `ABS(RANDOM()) % 100` can produce duplicate values. This would cause a spurious test failure. Consider increasing the iteration count or using a larger dataset to reduce flakiness.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Randomization of AutoClassifier Sample Data

4 participants