Skip to content
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
7054e63
fix(sampler): respect randomizedSample flag at 100% percentage sampling
RajdeepKushwaha5 Apr 2, 2026
389974e
Address review: use 'is False' for Optional[bool] and add unit tests
RajdeepKushwaha5 Apr 2, 2026
b665d8a
Add ORDER BY on random column in fetch_sample_data for true randomiza…
RajdeepKushwaha5 Apr 2, 2026
34a6c8d
Address review: remove stale comment, unused import, add return asser…
RajdeepKushwaha5 Apr 2, 2026
f746c76
Apply suggestion from @Copilot
TeddyCr Apr 2, 2026
6197729
Address review: move ORDER BY to get_sample_query, clean up fetch_sam…
RajdeepKushwaha5 Apr 2, 2026
68118b1
Address review: None defaults to False for randomizedSample
RajdeepKushwaha5 Apr 2, 2026
0180946
Fix integration test: None should skip sample_query (matches truthine…
RajdeepKushwaha5 Apr 2, 2026
8324863
fix(tests): update BigQuery view sampling expected queries with ORDER BY
RajdeepKushwaha5 Apr 2, 2026
fb8c05a
refactor: use explicit is False for randomizedSample checks
RajdeepKushwaha5 Apr 2, 2026
cf253ac
ci: re-trigger checks after SIGSEGV flake
RajdeepKushwaha5 Apr 2, 2026
53df3cb
refactor: only explicit True randomizes, add non-determinism tests
RajdeepKushwaha5 Apr 3, 2026
bfd09e8
test: increase non-determinism iterations to reduce flakiness
RajdeepKushwaha5 Apr 3, 2026
692f27b
chore: added randomize as false
TeddyCr Apr 3, 2026
e31a335
Merge remote-tracking branch 'upstream/main' into fix/21304-sampler-r…
TeddyCr Apr 3, 2026
3909d3c
Merge remote-tracking branch 'upstream/main' into fix/21304-sampler-r…
TeddyCr Apr 3, 2026
a084392
fix: align randomizedSample defaults with schema (false)
RajdeepKushwaha5 Apr 3, 2026
61b0678
Merge branch 'main' into fix/21304-sampler-randomization
TeddyCr Apr 7, 2026
7636fe3
Merge branch 'main' into fix/21304-sampler-randomization
RajdeepKushwaha5 Apr 7, 2026
3dcfe85
Merge branch 'main' into fix/21304-sampler-randomization
RajdeepKushwaha5 Apr 11, 2026
06f70ea
fix: remove ORDER BY from BigQuery test expectations
RajdeepKushwaha5 Apr 11, 2026
8667db8
Merge branch 'main' into fix/21304-sampler-randomization
RajdeepKushwaha5 Apr 11, 2026
b3bd457
test: increase non-determinism test iterations to reduce flakiness
RajdeepKushwaha5 Apr 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,24 @@ SET json = JSON_REMOVE(json, '$.sourceConfig.config.computeMetrics')
WHERE JSON_EXTRACT(json, '$.sourceConfig.config.computeMetrics') IS NOT NULL
AND pipelineType = 'profiler';

-- Set randomizedSample to false where it was true (old default behavior)
UPDATE ingestion_pipeline_entity
SET json = JSON_SET(json, '$.sourceConfig.config.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.sourceConfig.config.randomizedSample') = true
AND pipelineType = 'profiler';

UPDATE table_entity
SET json = JSON_SET(json, '$.tableProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.tableProfilerConfig.randomizedSample') = true;

UPDATE database_entity
SET json = JSON_SET(json, '$.databaseProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.databaseProfilerConfig.randomizedSample') = true;

UPDATE database_schema_entity
SET json = JSON_SET(json, '$.databaseSchemaProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.databaseSchemaProfilerConfig.randomizedSample') = true;

Comment thread
TeddyCr marked this conversation as resolved.
Comment on lines +6 to +23
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migration flips any stored randomizedSample=true to false, which will override existing user choices and silently change runtime behavior. If you only want a new default, avoid rewriting explicit values; migrate only missing/null fields if needed.

If the intent is to globally disable randomization, please treat this as a breaking change and document it clearly instead of silently mutating config JSON.

Suggested change
-- Set randomizedSample to false where it was true (old default behavior)
UPDATE ingestion_pipeline_entity
SET json = JSON_SET(json, '$.sourceConfig.config.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.sourceConfig.config.randomizedSample') = true
AND pipelineType = 'profiler';
UPDATE table_entity
SET json = JSON_SET(json, '$.tableProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.tableProfilerConfig.randomizedSample') = true;
UPDATE database_entity
SET json = JSON_SET(json, '$.databaseProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.databaseProfilerConfig.randomizedSample') = true;
UPDATE database_schema_entity
SET json = JSON_SET(json, '$.databaseSchemaProfilerConfig.randomizedSample', false)
WHERE JSON_EXTRACT(json, '$.databaseSchemaProfilerConfig.randomizedSample') = true;
-- Preserve existing randomizedSample values during migration.
-- Any new default for randomizedSample must be applied only to missing/null
-- fields in application logic or a separate, explicitly documented migration.

Copilot uses AI. Check for mistakes.
-- Hard-delete ingestion pipelines for Iceberg services (must run before service migration)
DELETE ipe FROM ingestion_pipeline_entity ipe
JOIN dbservice_entity dse
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,24 @@ SET json = (json::jsonb #- '{sourceConfig,config,computeMetrics}')::json
WHERE json::jsonb -> 'sourceConfig' -> 'config' -> 'computeMetrics' IS NOT NULL
AND pipelineType = 'profiler';

-- Set randomizedSample to false where it was true (old default behavior)
UPDATE ingestion_pipeline_entity
SET json = jsonb_set(json::jsonb, '{sourceConfig,config,randomizedSample}', 'false'::jsonb)::json
WHERE json::jsonb #>> '{sourceConfig,config,randomizedSample}' = 'true'
AND pipelineType = 'profiler';

UPDATE table_entity
SET json = jsonb_set(json::jsonb, '{tableProfilerConfig,randomizedSample}', 'false'::jsonb)::json
WHERE json::jsonb #>> '{tableProfilerConfig,randomizedSample}' = 'true';

UPDATE database_entity
SET json = jsonb_set(json::jsonb, '{databaseProfilerConfig,randomizedSample}', 'false'::jsonb)::json
WHERE json::jsonb #>> '{databaseProfilerConfig,randomizedSample}' = 'true';

UPDATE database_schema_entity
SET json = jsonb_set(json::jsonb, '{databaseSchemaProfilerConfig,randomizedSample}', 'false'::jsonb)::json
WHERE json::jsonb #>> '{databaseSchemaProfilerConfig,randomizedSample}' = 'true';

Comment thread
TeddyCr marked this conversation as resolved.
-- Hard-delete ingestion pipelines for Iceberg services (must run before service migration)
DELETE FROM ingestion_pipeline_entity ipe
USING dbservice_entity dse
Expand Down
6 changes: 5 additions & 1 deletion ingestion/src/metadata/sampler/pandas/sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,13 @@ def get_dataset(self, **__):
if self.partition_details:
raw_dataset = self._partitioned_table()

if not self.sample_config.profileSample or (
if not self.sample_config.profileSample:
return raw_dataset

if (
self.sample_config.profileSample == 100
and self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE
and self.sample_config.randomizedSample is not True
Comment thread
RajdeepKushwaha5 marked this conversation as resolved.
):
return raw_dataset
return self.get_sampled_dataframe(raw_dataset, self.sample_config)
Expand Down
21 changes: 15 additions & 6 deletions ingestion/src/metadata/sampler/sqlalchemy/sampler.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,22 +164,25 @@ def get_sample_query(self, *, column=None) -> Query:
(ModuloFn(RandomNumFn(), 100)).label(RANDOM_LABEL),
).cte(f"{self.get_sampler_table_name()}_rnd")
session_query = client.query(rnd)
return session_query.where(
query = session_query.where(
rnd.c.random <= self.sample_config.profileSample
).cte(f"{self.get_sampler_table_name()}_sample")
)
if self.sample_config.randomizedSample is True:
query = query.order_by(rnd.c.random)
Comment thread
TeddyCr marked this conversation as resolved.
return query.cte(f"{self.get_sampler_table_name()}_sample")
Comment thread
TeddyCr marked this conversation as resolved.

table_query = client.query(self.raw_dataset)
if self.partition_details:
table_query = self.get_partitioned_query(table_query)
session_query = self._base_sample_query(
column,
(ModuloFn(RandomNumFn(), table_query.count())).label(RANDOM_LABEL)
if self.sample_config.randomizedSample
if self.sample_config.randomizedSample is True
else None,
)
query = (
session_query.order_by(RANDOM_LABEL)
if self.sample_config.randomizedSample
if self.sample_config.randomizedSample is True
else session_query
)
Comment thread
RajdeepKushwaha5 marked this conversation as resolved.
return query.limit(self.sample_config.profileSample).cte(
Expand All @@ -194,9 +197,16 @@ def get_dataset(self, column=None, **__) -> Union[type, AliasedClass]:
if self.sample_query:
return self._rdn_sample_from_user_query()

if not self.sample_config.profileSample or (
if not self.sample_config.profileSample:
if self.partition_details:
return self._partitioned_table()

return self.raw_dataset

if (
self.sample_config.profileSampleType == ProfileSampleType.PERCENTAGE
and self.sample_config.profileSample == 100
and self.sample_config.randomizedSample is not True
):
Comment thread
TeddyCr marked this conversation as resolved.
if self.partition_details:
return self._partitioned_table()
Expand All @@ -217,7 +227,6 @@ def fetch_sample_data(self, columns: Optional[List[Column]] = None) -> TableData
if self.sample_query:
return self._fetch_sample_data_from_user_query()

# Add new RandomNumFn column
ds = self.get_dataset()
if not columns:
sqa_columns = [col for col in inspect(ds).c if col.name != RANDOM_LABEL]
Comment thread
RajdeepKushwaha5 marked this conversation as resolved.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ def test_sampling_for_views(self, sampler_mock):
'WITH "9bc65c2abec141778ffaa729489f3e87_rnd" AS \n(SELECT users.id AS id, ABS(RANDOM()) * 100 %% 100 AS random \n'
'FROM users)\n SELECT "9bc65c2abec141778ffaa729489f3e87_rnd".id, "9bc65c2abec141778ffaa729489f3e87_rnd".random \n'
'FROM "9bc65c2abec141778ffaa729489f3e87_rnd" \nWHERE "9bc65c2abec141778ffaa729489f3e87_rnd".random <= 50.0'
' ORDER BY "9bc65c2abec141778ffaa729489f3e87_rnd".random'
)
Comment thread
RajdeepKushwaha5 marked this conversation as resolved.
assert (
expected_query.casefold()
Expand Down Expand Up @@ -201,6 +202,7 @@ def test_sampling_view_with_partition(self, sampler_mock):
'WITH "9bc65c2abec141778ffaa729489f3e87_rnd" AS \n(SELECT users.id AS id, ABS(RANDOM()) * 100 %% 100 AS random \n'
"FROM users \nWHERE id in ('1', '2'))\n SELECT \"9bc65c2abec141778ffaa729489f3e87_rnd\".id, \"9bc65c2abec141778ffaa729489f3e87_rnd\".random \n"
'FROM "9bc65c2abec141778ffaa729489f3e87_rnd" \nWHERE "9bc65c2abec141778ffaa729489f3e87_rnd".random <= 50.0'
' ORDER BY "9bc65c2abec141778ffaa729489f3e87_rnd".random'
)
assert (
expected_query.casefold()
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,12 @@
from sqlalchemy.orm import DeclarativeBase

from metadata.generated.schema.entity.data.table import Column as EntityColumn
from metadata.generated.schema.entity.data.table import ColumnName, DataType, Table
from metadata.generated.schema.entity.data.table import (
ColumnName,
DataType,
ProfileSampleType,
Table,
)
from metadata.generated.schema.entity.services.connections.database.sqliteConnection import (
SQLiteConnection,
SQLiteScheme,
Expand Down Expand Up @@ -361,6 +366,114 @@ def test_sample_from_user_query(self, sampler_mock):
names = [col.root for col in sample_data.columns]
assert names == ["id", "name"]

def test_full_percentage_randomized_uses_sample_query(self, sampler_mock):
"""100% PERCENTAGE + randomizedSample=True should go through
get_sample_query so fetch_sample_data can ORDER BY the random column."""
with patch.object(SQASampler, "build_table_orm", return_value=User):
sampler = SQASampler(
service_connection_config=self.sqlite_conn,
ometa_client=None,
entity=None,
sample_config=SampleConfig(
profileSampleType=ProfileSampleType.PERCENTAGE,
profileSample=100,
randomizedSample=True,
),
sample_data_count=5,
)

with patch.object(
sampler, "get_sample_query", wraps=sampler.get_sample_query
) as mock_gsq:
sampler.fetch_sample_data()
assert mock_gsq.called
Comment on lines +369 to +389
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new “full percentage randomized” tests only assert that get_sample_query was invoked, but they don’t validate the actual bug report: that fetch_sample_data() returns a non-deterministic subset (or at least that the generated SQL applies ORDER BY on the random column before LIMIT). Consider asserting on the compiled query (contains ORDER BY random) or verifying that repeated calls can yield different first rows under SQLite to ensure the regression is truly covered.

Copilot uses AI. Check for mistakes.

def test_full_percentage_not_randomized_skips_sample_query(self, sampler_mock):
"""100% PERCENTAGE + randomizedSample=False should short-circuit
to raw dataset and NOT call get_sample_query."""
with patch.object(SQASampler, "build_table_orm", return_value=User):
sampler = SQASampler(
service_connection_config=self.sqlite_conn,
ometa_client=None,
entity=None,
sample_config=SampleConfig(
profileSampleType=ProfileSampleType.PERCENTAGE,
profileSample=100,
randomizedSample=False,
),
sample_data_count=5,
)

with patch.object(
sampler, "get_sample_query", wraps=sampler.get_sample_query
) as mock_gsq:
sampler.fetch_sample_data()
assert not mock_gsq.called

def test_full_percentage_none_randomized_skips_sample_query(self, sampler_mock):
"""100% PERCENTAGE + randomizedSample=None should short-circuit
(only explicit True enables randomization)."""
with patch.object(SQASampler, "build_table_orm", return_value=User):
sampler = SQASampler(
service_connection_config=self.sqlite_conn,
ometa_client=None,
entity=None,
sample_config=SampleConfig(
profileSampleType=ProfileSampleType.PERCENTAGE,
profileSample=100,
randomizedSample=None,
),
Comment on lines +413 to +425
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new tests assert that randomizedSample=None should short-circuit (i.e., behave like False). This conflicts with the PR description’s stated intent of handling Optional[bool] correctly by only skipping randomization when randomizedSample is False (so None follows the default behavior).

Either adjust the implementation to treat None as “use default” (and update these tests accordingly), or explicitly document and enforce the new “opt-in randomization” contract across schema defaults, models, and migration strategy.

Copilot uses AI. Check for mistakes.
sample_data_count=5,
)
Comment on lines +413 to +427
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests expect randomizedSample=None to behave like False (short-circuit). Since the schema defines randomizedSample as boolean default true (non-null), consider removing the None scenario or asserting it follows the default behavior (randomize) to avoid locking in an inconsistent contract.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TeddyCr explicitly requested this behavior: "None or False should not randomize. only explicit True should randomize." The None test validates that edge case. The schema default is True, so in practice users always get True unless they explicitly pass None or False.


with patch.object(
sampler, "get_sample_query", wraps=sampler.get_sample_query
) as mock_gsq:
sampler.fetch_sample_data()
assert not mock_gsq.called
Comment on lines +413 to +433
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test treats randomizedSample=None as equivalent to disabling randomization, but SampleConfig.randomizedSample defaults to True and None should not be coerced to False if we want correct Optional[bool] semantics. Adjust the test to match the intended behavior (i.e., None should still use the sampling path), or change the model/schema if None is meant to disable randomization.

Copilot uses AI. Check for mistakes.

def test_randomized_true_produces_non_deterministic_rows(self, sampler_mock):
"""With randomizedSample=True at 100% PERCENTAGE, multiple
fetch_sample_data calls should return rows in different orders."""
with patch.object(SQASampler, "build_table_orm", return_value=User):
sampler = SQASampler(
service_connection_config=self.sqlite_conn,
ometa_client=None,
entity=None,
sample_config=SampleConfig(
profileSampleType=ProfileSampleType.PERCENTAGE,
profileSample=100,
randomizedSample=True,
),
sample_data_count=5,
)

results = [sampler.fetch_sample_data().rows for _ in range(10)]
assert any(
results[i] != results[0] for i in range(1, len(results))
), "Expected non-deterministic row ordering with randomizedSample=True"

Comment thread
RajdeepKushwaha5 marked this conversation as resolved.
def test_randomized_false_produces_deterministic_rows(self, sampler_mock):
"""With randomizedSample=False at 100% PERCENTAGE, multiple
fetch_sample_data calls should return rows in the same order."""
with patch.object(SQASampler, "build_table_orm", return_value=User):
sampler = SQASampler(
service_connection_config=self.sqlite_conn,
ometa_client=None,
entity=None,
sample_config=SampleConfig(
profileSampleType=ProfileSampleType.PERCENTAGE,
profileSample=100,
randomizedSample=False,
),
sample_data_count=5,
)

results = [sampler.fetch_sample_data().rows for _ in range(5)]
assert all(
results[i] == results[0] for i in range(1, len(results))
), "Expected deterministic row ordering with randomizedSample=False"

@classmethod
def tearDownClass(cls) -> None:
os.remove(cls.db_path)
Expand Down
120 changes: 120 additions & 0 deletions ingestion/tests/unit/sampler/test_sampler_100_pct.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Copyright 2025 Collate
# Licensed under the Collate Community License, Version 1.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/LICENSE
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Tests for 100% PERCENTAGE sampling edge case (#21304).

Verifies that the get_dataset() short-circuit at 100% correctly
respects the randomizedSample flag. Only an explicit True enables
randomization; None and False both skip randomization.
"""
Comment on lines +11 to +17
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module docstring says randomizedSample=None “defaults to False (no randomization)”, but SampleConfig.randomizedSample defaults to True, and the new integration test in test_sample.py asserts that None should behave like default randomized (i.e., not False). Please align the documentation (and the expectations below) with the actual intended semantics: treat None as default True unless there’s a deliberate backward-compat reason.

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +17
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring states that randomizedSample=None defaults to False, but SampleConfig.randomizedSample is defined with default True in ingestion/src/metadata/sampler/models.py (and the JSON schema defaults to true). This docstring should be corrected to avoid encoding an incorrect contract in tests.

Copilot uses AI. Check for mistakes.
from unittest.mock import MagicMock, patch

from metadata.generated.schema.entity.data.table import ProfileSampleType
from metadata.sampler.models import SampleConfig


class TestSQASampler100Pct:
"""Test SQASampler.get_dataset() at 100% PERCENTAGE sampling."""

def _make_sampler(self, randomized_sample):
"""Create a SQASampler mock with the given randomizedSample value."""
with patch(
"metadata.sampler.sqlalchemy.sampler.SQASampler.__init__",
return_value=None,
):
from metadata.sampler.sqlalchemy.sampler import SQASampler

sampler = SQASampler()
sampler.sample_config = SampleConfig(
profileSample=100,
profileSampleType=ProfileSampleType.PERCENTAGE,
randomizedSample=randomized_sample,
)
sampler.sample_query = None
sampler.partition_details = None
sampler._table = MagicMock(name="raw_table")
sampler.get_sample_query = MagicMock(
name="get_sample_query", return_value=MagicMock(name="sample_cte")
)
return sampler

def test_100_pct_randomized_true_delegates_to_sample_query(self):
"""100% + randomizedSample=True should NOT short-circuit."""
sampler = self._make_sampler(randomized_sample=True)
result = sampler.get_dataset()
sampler.get_sample_query.assert_called_once()
assert result == sampler.get_sample_query.return_value

def test_100_pct_randomized_false_returns_raw_dataset(self):
"""100% + randomizedSample=False should short-circuit to raw dataset."""
sampler = self._make_sampler(randomized_sample=False)
result = sampler.get_dataset()
sampler.get_sample_query.assert_not_called()
assert result == sampler._table

def test_100_pct_randomized_none_returns_raw_dataset(self):
"""100% + randomizedSample=None should short-circuit (only explicit True randomizes)."""
sampler = self._make_sampler(randomized_sample=None)
result = sampler.get_dataset()
sampler.get_sample_query.assert_not_called()
assert result == sampler._table
Comment on lines +63 to +68
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The randomizedSample=None test cases currently expect a 100% PERCENTAGE short-circuit (no randomization). That contradicts both SampleConfig.randomizedSample’s default (True) and the new integration test that asserts None should still go through the randomized sampling path. These unit tests should be updated so that None behaves like the default randomized setting (i.e., delegates to get_sample_query() / get_sampled_dataframe()).

Copilot uses AI. Check for mistakes.


class TestDatalakeSampler100Pct:
"""Test DatalakeSampler.get_dataset() at 100% PERCENTAGE sampling."""

def _make_sampler(self, randomized_sample):
"""Create a DatalakeSampler mock with the given randomizedSample value."""
with patch(
"metadata.sampler.pandas.sampler.DatalakeSampler.__init__",
return_value=None,
):
from metadata.sampler.pandas.sampler import DatalakeSampler

sampler = DatalakeSampler()
sampler.sample_config = SampleConfig(
profileSample=100,
profileSampleType=ProfileSampleType.PERCENTAGE,
randomizedSample=randomized_sample,
)
sampler.sample_query = None
sampler.partition_details = None
table_mock = MagicMock(name="table_wrapper")
table_mock.dataframes = MagicMock(name="raw_dataframes")
sampler._table = table_mock
sampler.get_sampled_dataframe = MagicMock(
name="get_sampled_dataframe",
return_value=MagicMock(name="sampled_df"),
)
sampler.service_connection_config = MagicMock()
sampler.connection = MagicMock()
return sampler

def test_100_pct_randomized_true_delegates_to_sampled_dataframe(self):
"""100% + randomizedSample=True should NOT short-circuit."""
sampler = self._make_sampler(randomized_sample=True)
result = sampler.get_dataset()
sampler.get_sampled_dataframe.assert_called_once()
assert result == sampler.get_sampled_dataframe.return_value

Comment on lines +101 to +107
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In test_100_pct_randomized_true_delegates_to_sampled_dataframe, the assertions are duplicated (assert_called_once and the assert result == ... block appears twice). This is redundant and makes the test harder to read/maintain; keep a single assertion block.

Copilot uses AI. Check for mistakes.
def test_100_pct_randomized_false_returns_raw_dataset(self):
"""100% + randomizedSample=False should short-circuit to raw dataset."""
sampler = self._make_sampler(randomized_sample=False)
result = sampler.get_dataset()
sampler.get_sampled_dataframe.assert_not_called()
assert result == sampler._table.dataframes

def test_100_pct_randomized_none_returns_raw_dataset(self):
"""100% + randomizedSample=None should short-circuit (only explicit True randomizes)."""
sampler = self._make_sampler(randomized_sample=None)
result = sampler.get_dataset()
sampler.get_sampled_dataframe.assert_not_called()
assert result == sampler._table.dataframes
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@
"randomizedSample": {
"description": "Whether to randomize the sample data or not.",
"type": "boolean",
"default": true
"default": false
Comment thread
TeddyCr marked this conversation as resolved.
}
Comment on lines 175 to 179
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

databaseProfilerConfig.randomizedSample default is changed from true to false, which flips default sampling behavior for database-level profiler configs. This is a breaking behavior change and is not reflected in the PR title/description (which states the default is true).

Please either revert this default flip (keeping the PR scoped to the 100% sampling bug) or explicitly document the behavior change and ensure all layers (schemas, ingestion models, UI defaults, migrations) are updated consistently.

Copilot uses AI. Check for mistakes.
Comment thread
RajdeepKushwaha5 marked this conversation as resolved.
}
},
Expand Down
Loading
Loading