Skip to content

Commit 58b2604

Browse files
authored
Merge pull request #1061 from betolink/fsspec
* fsspec data fetching improvements and exposing the kwargs so users can tweak it further * retry logic if a download fails. * multi-threaded download in the cloud (fixed) * show_progress=True by default if the session is interactive
2 parents fd326d5 + 8237759 commit 58b2604

11 files changed

Lines changed: 3531 additions & 1779 deletions

File tree

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,11 @@ and this project uses [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
99

1010
### Changed
1111

12+
- Change default cache behavior in fsspec from `readahead` to `blockcache`.
13+
Allow user defined config with `open_kwargs` in the `.open()` method.
14+
This improves performance by an order of magnitude.
15+
([#251](https://github.com/nsidc/earthaccess/discussions/251))([#771](https://github.com/nsidc/earthaccess/discussions/771))
16+
([@betolink](https://github.com/betolink))
1217
- Add `show_progress` argument to `earthaccess.download()` to let the user control display of progress bars. Defaults to true for interactive sessions, otherwise false.
1318
([#612](https://github.com/nsidc/earthaccess/issues/612))
1419
([#1065](https://github.com/nsidc/earthaccess/pull/1065))
@@ -25,6 +30,9 @@ and this project uses [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
2530

2631
### Added
2732

33+
- Added `tenacity` to retry downloads up to 3 times with exponential backoff time, replaces #1016
34+
([#481](https://github.com/nsidc/earthaccess/issues/481))
35+
([@betolink](https://github.com/betolink))
2836
- Add notebook demonstrating workflow with TEMPO Level 3 data as a virtual dataset
2937
([#924](https://github.com/nsidc/earthaccess/pull/924))
3038
([@danielfromearth](https://github.com/danielfromearth))
@@ -60,6 +68,7 @@ and this project uses [Semantic Versioning](https://semver.org/spec/v2.0.0.html)
6068

6169
### Fixed
6270

71+
- Files can be downloaded in the cloud([#1009](https://github.com/nsidc/earthaccess/issues/1009))([betolink](https://github.com/betolink))
6372
- Corrected Harmony typo in notebooks/Demo.ipynb([#995](https://github.com/nsidc/earthaccess/issues/995))([stelios-c](https://github.com/stelios-c))
6473
- Resolved an error in virtual dataset tutorial notebook ([#1044](https://github.com/nsidc/earthaccess/issues/1044))([danielfromearth](https://github.com/danielfromearth))
6574
- Issue when `FileDistributionInformation` did not exist for a collection

docs/user_guide/fsspec.ipynb

Lines changed: 1684 additions & 0 deletions
Large diffs are not rendered by default.

earthaccess/api.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -284,6 +284,7 @@ def download(
284284
threads: int = 8,
285285
*,
286286
show_progress: Optional[bool] = None,
287+
credentials_endpoint: Optional[str] = None,
287288
pqdm_kwargs: Optional[Mapping[str, Any]] = None,
288289
) -> List[Path]:
289290
"""Retrieves data granules from a remote storage system. Provide the optional `local_path` argument to prevent repeated downloads.
@@ -300,6 +301,8 @@ def download(
300301
month, and day of the current date, and `UUID` is the last 6 digits
301302
of a UUID4 value.
302303
provider: if we download a list of URLs, we need to specify the provider.
304+
credentials_endpoint: S3 credentials endpoint to be used for obtaining temporary S3 credentials. This is only required if
305+
the metadata doesn't include it, or we pass urls to the method instead of `DataGranule` instances.
303306
threads: parallel number of threads to use to download the files, adjust as necessary, default = 8
304307
show_progress: whether or not to display a progress bar. If not specified, defaults to `True` for interactive sessions
305308
(i.e., in a notebook or a python REPL session), otherwise `False`.
@@ -326,6 +329,7 @@ def download(
326329
local_path,
327330
provider,
328331
threads,
332+
credentials_endpoint=credentials_endpoint,
329333
show_progress=show_progress,
330334
pqdm_kwargs=pqdm_kwargs,
331335
)
@@ -341,8 +345,10 @@ def open(
341345
granules: Union[List[str], List[DataGranule]],
342346
provider: Optional[str] = None,
343347
*,
348+
credentials_endpoint: Optional[str] = None,
344349
show_progress: Optional[bool] = None,
345350
pqdm_kwargs: Optional[Mapping[str, Any]] = None,
351+
open_kwargs: Optional[Dict[str, Any]] = None,
346352
) -> List[AbstractFileSystem]:
347353
"""Returns a list of file-like objects that can be used to access files
348354
hosted on S3 or HTTPS by third party libraries like xarray.
@@ -356,15 +362,19 @@ def open(
356362
pqdm_kwargs: Additional keyword arguments to pass to pqdm, a parallel processing library.
357363
See pqdm documentation for available options. Default is to use immediate exception behavior
358364
and the number of jobs specified by the `threads` parameter.
365+
open_kwargs: Additional keyword arguments to pass to `fsspec.open`, such as `cache_type` and `block_size`.
366+
Defaults to using `blockcache` with a block size determined by the file size (4 to 16MB).
359367
360368
Returns:
361369
A list of "file pointers" to remote (i.e. s3 or https) files.
362370
"""
363371
return earthaccess.__store__.open(
364372
granules=granules,
365373
provider=_normalize_location(provider),
374+
credentials_endpoint=credentials_endpoint,
366375
show_progress=show_progress,
367376
pqdm_kwargs=pqdm_kwargs,
377+
open_kwargs=open_kwargs,
368378
)
369379

370380

0 commit comments

Comments
 (0)