Skip to content

Update dependency datasets to v2.21.0#4456

Open
renovate-bot wants to merge 1 commit intoGoogleCloudPlatform:mainfrom
renovate-bot:renovate/datasets-2.x
Open

Update dependency datasets to v2.21.0#4456
renovate-bot wants to merge 1 commit intoGoogleCloudPlatform:mainfrom
renovate-bot:renovate/datasets-2.x

Conversation

@renovate-bot
Copy link
Copy Markdown
Contributor

@renovate-bot renovate-bot commented Mar 5, 2026

This PR contains the following updates:

Package Change Age Confidence
datasets ==2.18.0==2.21.0 age confidence

Release Notes

huggingface/datasets (datasets)

v2.21.0

Compare Source

Features

  • Support pyarrow large_list by @​albertvillanova in #​7019
    • Support Polars round trip:
      import polars as pl
      from datasets import Dataset
      
      df1 = pl.from_dict({"col_1": [[1, 2], [3, 4]]}
      df2 = Dataset.from_polars(df).to_polars()
      assert df1.equals(df2)

What's Changed

New Contributors

Full Changelog: huggingface/datasets@2.20.0...2.21.0

v2.20.0

Compare Source

Important

  • Remove default trust_remote_code=True by @​lhoestq in #​6954
    • datasets with a python loading script now require passing trust_remote_code=True to be used

Datasets features

  • [Resumable IterableDataset] Add IterableDataset state_dict by @​lhoestq in #​6658
    • checkpoint and resume an iterable dataset (e.g. when streaming):

      >>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
      >>> for idx, example in enumerate(iterable_dataset):
      ...     print(example)
      ...     if idx == 2:
      ...         state_dict = iterable_dataset.state_dict()
      ...         print("checkpoint")
      ...         break
      >>> iterable_dataset.load_state_dict(state_dict)
      >>> print(f"restart from checkpoint")
      >>> for example in iterable_dataset:
      ...     print(example)

      Returns:

      {'a': 0}
      {'a': 1}
      {'a': 2}
      checkpoint
      restart from checkpoint
      {'a': 3}
      {'a': 4}
      {'a': 5}
      

General improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@2.19.0...2.20.0

v2.19.2

Compare Source

Bug fixes

Full Changelog: huggingface/datasets@2.19.1...2.19.2

v2.19.1

Compare Source

Bug fixes

Full Changelog: huggingface/datasets@2.19.0...2.19.1

v2.19.0

Compare Source

Dataset Features

  • Add Polars compatibility by @​psmyth94 in #​6531
    • convert to a Polars dataframe using .to_polars();
      import polars as pl
      from datasets import load_dataset
      ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
      ds.to_polars() \
          .groupby("topic") \
          .agg(pl.len(), pl.first()) \
          .sort("len", descending=True)
    • Use Polars formatting to return Polars objects when accessing a dataset:
      ds = ds.with_format("polars")
      ds[:10].group_by("kind").len()
  • Add fsspec support for to_json, to_csv, and to_parquet by @​alvarobartt in #​6096
    • Save on HF in any file format:
      ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
      ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
      ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
  • Add mode parameter to Image feature by @​mariosasko in #​6735
    • Set images to be read in a certain mode like "RGB"
      dataset = dataset.cast_column("image", Image(mode="RGB"))
  • Add CLI function to convert script-dataset to Parquet by @​albertvillanova in #​6795
    • run command to open a PR in script-based dataset to convert it to Parquet:
      datasets-cli convert_to_parquet <dataset_id>
      
  • Add Dataset.take and Dataset.skip by @​lhoestq in #​6813
    • same as IterableDataset.take and IterableDataset.skip
      ds = ds.take(10)  # take only the first 10 examples

General improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@2.18.0...2.19.0


Configuration

📅 Schedule: (UTC)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate-bot renovate-bot requested a review from a team as a code owner March 5, 2026 17:51
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the datasets Python package to its latest version, 2.21.0. This upgrade incorporates a range of new features and improvements, notably enhancing integration with Polars for data manipulation and introducing advanced functionalities for IterableDataset like checkpointing and batching. The update also includes numerous bug fixes, security patches, and ensures broader compatibility with other ecosystem libraries, contributing to a more robust and efficient data handling experience. Users should be aware of a change in the default behavior of trust_remote_code for datasets with Python loading scripts.

Highlights

  • Dependency Update: The datasets Python package has been updated from version 2.18.0 to 2.21.0.
  • Polars Integration: Enhanced compatibility with Polars, including support for pyarrow large_list, from_polars(), to_polars() methods, and Polars formatting for dataset objects.
  • IterableDataset Improvements: Introduced resumable IterableDataset with state_dict for checkpointing and resuming, and added batching capabilities to IterableDataset and Dataset classes.
  • Security and Stability: Addressed several bug fixes, vulnerability updates for tqdm and requests, and improved compatibility with various libraries including NumPy 2.0, fsspec, and PyArrow.
  • Breaking Change Notice: The default trust_remote_code parameter for datasets with Python loading scripts was changed from True to False in v2.20.0, requiring explicit opt-in for such datasets.
Changelog
  • skills/vertex-tuning/references/requirements.txt
    • Updated the datasets package version from 2.18.0 to 2.21.0.
Activity
  • This pull request was automatically generated by Renovate Bot to update the datasets dependency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the datasets dependency from 2.18.0 to 2.21.0. I've reviewed the release notes for the new versions and analyzed how the datasets library is used in prepare_dataset.py. The most significant change is trust_remote_code now defaulting to False in v2.20.0, which enhances security. This change does not affect the project's current implementation, as it loads local data using built-in loaders. The update appears safe to merge.

Note: Security Review has been skipped due to the limited scope of the PR.

@renovate-bot renovate-bot force-pushed the renovate/datasets-2.x branch from 33da84f to 40f5def Compare March 11, 2026 19:46
@renovate-bot renovate-bot force-pushed the renovate/datasets-2.x branch from 40f5def to b2bc9b6 Compare March 26, 2026 16:43
@renovate-bot renovate-bot changed the title chore(deps): update dependency datasets to v2.21.0 Update dependency datasets to v2.21.0 Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant