Skip to content

Randomization of AutoClassifier Sample Data #21304

@KarimTZiad

Description

@KarimTZiad

I am working with self-hosted OpenMetadata (OSS) v1.7.0.
It seems that if the Profiler Settings for a table are kept unchanged (i.e., Percentage based sampling and 100% of all values), the sample data extracted via the AutoClassifier is deterministic and not randomized.
However, Changing "Percentage" to "Rows" and providing the same or a larger number of rows than in the dataset seems to randomize the extracted sample data. So does providing any value other than 100 for the Percentage based sampling.

Steps to Reproduce:
I have a simple Parquet file stored on a MinIO bucket that only has one column with rows that are enumerated 1-100. A corresponding table has been created on Trino.
On OpenMetadata, I have created and configured a Trino connector and ran all the agents.
When keeping the default profiler settings for the table (i.e., PERCENTAGE 100%) but altering the sample data size (5 rows in this case) and running the AutoClassifier, the sample data is always the rows with values 1-5. No matter how many times it gets executed.
However, by editing the profiler settings to a lower percentage or changing the sample method to Row Count (yet still including all rows (i.e., 100 rows)), the extracted sample data is indeed randomized.
Restoring settings to Percentage 100% then re-running the AutoClassifier yields the values 1-5 once again.

Metadata

Metadata

Assignees

Projects

Status

Done ✅

Status

Good first issues

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions