We recently added out-of-core batching to the single-gpu K-means, which allows it to accept a host matrix for training, along with a batch size so that it can break up the dataset to fit batches in device memory.
We should extend the multi-node mutli-gpu C++ NCCL implementation to allow each rank to specify a host matrix.
Since most distributed systems will already have partitioned their datasets, we should consider accepting the data as partitioned and setting the batch size automatically (note that the difference here is that each batch could potentially be a different size, so we might need a small update to the single-gpu version to also support an array of batch sizes).
We recently added out-of-core batching to the single-gpu K-means, which allows it to accept a host matrix for training, along with a batch size so that it can break up the dataset to fit batches in device memory.
We should extend the multi-node mutli-gpu C++ NCCL implementation to allow each rank to specify a host matrix.
Since most distributed systems will already have partitioned their datasets, we should consider accepting the data as partitioned and setting the batch size automatically (note that the difference here is that each batch could potentially be a different size, so we might need a small update to the single-gpu version to also support an array of batch sizes).