Randomization of AutoClassifier Sample Data

I am working with self-hosted OpenMetadata (OSS) v1.7.0.
It seems that if the Profiler Settings for a table are kept unchanged (i.e., Percentage based sampling and 100% of all values), the sample data extracted via the AutoClassifier is deterministic and not randomized.
However, Changing "Percentage" to "Rows" and providing the same or a larger number of rows than in the dataset seems to randomize the extracted sample data. So does providing any value other than 100 for the Percentage based sampling.

Steps to Reproduce:
I have a simple Parquet file stored on a MinIO bucket that only has one column with rows that are enumerated 1-100. A corresponding table has been created on Trino.
On OpenMetadata, I have created and configured a Trino connector and ran all the agents.
When keeping the default profiler settings for the table (i.e., PERCENTAGE 100%) but altering the sample data size (5 rows in this case) and running the AutoClassifier, the sample data is always the rows with values 1-5. No matter how many times it gets executed.
However, by editing the profiler settings to a lower percentage or changing the sample method to Row Count (yet still including all rows (i.e., 100 rows)), the extracted sample data is indeed randomized.
Restoring settings to Percentage 100% then re-running the AutoClassifier yields the values 1-5 once again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomization of AutoClassifier Sample Data #21304

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Randomization of AutoClassifier Sample Data #21304

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions