Skip to content

mrpottermusic/nccl-mesh-plugin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

128 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌐 nccl-mesh-plugin - Simplifying Your Distributed ML Setup

Download Release

🌟 Overview

The NCCL Mesh Plugin allows you to use NVIDIA's Collective Communications Library (NCCL) with unique mesh topologies. If you work with direct RDMA (Remote Direct Memory Access) connections, this plugin is designed for you. It ensures seamless communication, even when nodes are on different networks. This can significantly improve the performance of distributed machine learning tasks.

πŸš€ Getting Started

πŸ’Ύ System Requirements

Before you begin, ensure your system meets the following requirements:

  • Operating System: Linux (Ubuntu preferred)
  • Hardware:
    • At least 3 nodes with direct RDMA connections
    • Each node should have an NVIDIA GPU
  • Network: 100Gbps RDMA links are recommended

πŸ“₯ Download & Install

  1. Visit the Releases Page: Click the link below to access the latest version of the NCCL Mesh Plugin. Download Release

  2. Choose Your Version: Once on the releases page, look for the latest version. You will see various assets available for download.

  3. Download the Plugin: Click on the appropriate file to download it to your computer.

  4. Unpack the Files: If your download is compressed (like a .zip or .tar file), extract it to a folder on your computer.

  5. Install Dependencies: Make sure you have the following installed on your system:

    • NCCL v2.7 or later
    • cuDNN and CUDA compatible with your GPU
  6. Run the Plugin: Follow the instructions in the README included in the plugin folder to start using it.

πŸ“Š Supported Topologies

The NCCL Mesh Plugin supports various topologies to fit your needs:

  • Full Mesh (3 nodes): Every node connects directly to every other node. This topology offers the best performance for small clusters.

  • Ring (4+ nodes): Each node connects to two neighbors. This setup uses relay routing for non-adjacent nodes, balancing speed and efficiency for larger groups.

  • Line (any number): A simple chain of nodes allows for easy setup. It uses relay routing for multi-hop communication, suitable for simple configurations.

πŸ€– Configuration

For optimal use, configure your environment as follows:

  1. Setup Network Interfaces: Ensure each node can communicate over your RDMA network.

  2. Install NCCL: Verify you have NCCL installed using your package manager or by following NVIDIA's installation instructions.

  3. Build the Plugin: Follow the build instructions in the README to compile the plugin for your system.

  4. Test Your Setup: Consider running the provided test cases to check connectivity and performance.

πŸ”§ Troubleshooting

If you encounter issues, check these common problems:

  • Connectivity Issues: Ensure all nodes are correctly connected and configured. Use tools like ping to verify connections.

  • Performance Drops: Check your hardware usage and ensure no single node is overloaded. Optimize network settings if necessary.

  • Compatibility Problems: Verify that NCCL, CUDA, and cuDNN versions match the requirements listed earlier.

πŸ“˜ Additional Resources

For more information, consider these links:

πŸ“… Future Updates

We plan to release regular updates to improve functionality and support additional topologies. Always check the releases page to stay informed.

To download the latest version, visit: Download Release

About

🌐 Enable distributed ML with the NCCL Mesh Plugin for efficient communication in direct-connect RDMA mesh topologies, overcoming traditional networking limits.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors