No specific reproducer attached, but this is observable on a variety of workloads using large DFs.
Modin on Ray uses excessive amounts of memory for a wide range of operations, and we should investigate how this can be reduced. Consider this diagram from the authors of Dias [paper + GH] and PandasBench, where Modin uses obscene amounts of memory for some notebooks:
While testing an implementation for parallel writes from Ray datasets to Snowflake, we also observed the upload of a ~3GB in-memory dataframe to be using well over 100GB RAM.
Further areas of investigation:
- Is this entirely a Ray issue, or are there things Modin can do to address it?
- Is excessive memory usage also an issue for Modin on Dask? The PandasBench paper indicates that dask DFs' peak memory consumption is relatively low.
See also: #5524
No specific reproducer attached, but this is observable on a variety of workloads using large DFs.
Modin on Ray uses excessive amounts of memory for a wide range of operations, and we should investigate how this can be reduced. Consider this diagram from the authors of Dias [paper + GH] and PandasBench, where Modin uses obscene amounts of memory for some notebooks:
While testing an implementation for parallel writes from Ray datasets to Snowflake, we also observed the upload of a ~3GB in-memory dataframe to be using well over 100GB RAM.
Further areas of investigation:
See also: #5524