feat: Add unique rule to dy.Column#325
feat: Add unique rule to dy.Column#325Andreas Albert (AndreasAlbertQC) merged 3 commits intoQuantco:mainfrom
unique rule to dy.Column#325Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a per-column unique constraint to dy.Column/dy.Schema, aligning validation and SQLAlchemy output, and updates array/object primary key behavior.
Changes:
- Introduce
uniqueas a first-class column attribute and emitunique=Truein SQLAlchemy column definitions. - Add schema validation rules for
uniquecolumns (and tighten primary key validation viais_unique()). - Expand tests for unique constraints and enable primary keys on
Arraycolumns while disallowing them forObjectcolumns.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/schema/test_validate.py | Adds validation tests for unique columns and Schema.unique_columns(). |
| tests/schema/test_sample.py | Adds sampling tests ensuring generated data respects unique constraints. |
| tests/column_types/test_array.py | Adds coverage for primary keys on Array columns. |
| dataframely/columns/_base.py | Adds unique attribute to Column and passes it to SQLAlchemy columns. |
| dataframely/_base_schema.py | Adds unique rules into schema validation; uses is_unique() for primary keys. |
| dataframely/columns/array.py | Allows primary_key on arrays and threads through unique. |
| dataframely/columns/string.py | Threads unique through to Column. |
| dataframely/columns/integer.py | Threads unique through to Column. |
| dataframely/columns/float.py | Threads unique through to Column. |
| dataframely/columns/decimal.py | Threads unique through to Column. |
| dataframely/columns/datetime.py | Threads unique through to Column for date/time/datetime/timedelta. |
| dataframely/columns/enum.py | Threads unique through to Column. |
| dataframely/columns/categorical.py | Threads unique through to Column. |
| dataframely/columns/list.py | Threads unique through to Column. |
| dataframely/columns/struct.py | Threads unique through to Column. |
| dataframely/columns/object.py | Removes primary_key kwarg from Object column constructor. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #325 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 56 56
Lines 3399 3408 +9
=========================================
+ Hits 3399 3408 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Andreas Albert (AndreasAlbertQC)
left a comment
There was a problem hiding this comment.
Thanks gab23r ! This looks quite nice to me already. Small suggestions.
is_unique rule to dy.Columnunique rule to dy.Column
Andreas Albert (AndreasAlbertQC)
left a comment
There was a problem hiding this comment.
Thanks gab23r !
7126c1e
into
Quantco:main
| primary_key = _primary_key(columns) | ||
| if len(primary_key) > 0: | ||
| rules["primary_key"] = Rule(~pl.struct(primary_key).is_duplicated()) | ||
| rules["primary_key"] = Rule(pl.struct(primary_key).is_unique()) |
There was a problem hiding this comment.
Meta-comment because I just noticed: pl.struct(primary_key).is_unique() is likely much more inefficient than pl.col(primary_key).is_unique() if we only have a single primary key. We might want to introduce an optimization for this after benchmarking 😄
| # Add unique column validation rules | ||
| unique_columns = _unique_columns(columns) | ||
| for col_name in unique_columns: | ||
| # wrap the column in a struct to make `is_unique` work with list/arrays | ||
| # https://github.com/pola-rs/polars/issues/27286 | ||
| rules[f"{col_name}|unique"] = Rule(pl.struct(col_name).is_unique()) |
There was a problem hiding this comment.
Sorry I didn't review previously but I find this implementation suboptimal. Why is it on the schema if we do not check composite uniqueness but uniqueness of individual columns? This should be on the column which would also allow for much more efficient evaluation of is_unique for primitive types (because we can very easily skip the struct-wrapping).
There was a problem hiding this comment.
Besides, this also breaks for nested types; for example setting unique on a list element is simply ignored.
Motivation
Closes #313
Changes
Add the new rule using the same logic than primary_keys.
Drive by:
primary_keysfor array dtype as it now works in polars, I added a test of it.primary_keysfor object dtype as it never worked. (technically breaking but IMHO shouldn't break anything)