Journal of Lightwave Technology / 2021
X-NEST: A Scalable, Flexible, and High-Performance Network Architecture for Distributed Machine Learning
In a large-scale distributed machine learning system, the interconnection network between computing devices has an important impact on performance in the training of neural network models. The current expansion of training data and model size has led to a rapid increase in the number of computing devices used in distributed machine learning systems, which places higher demands on network scalability. In addition, the synchronization algorithms used for data exchange between computing devices have different communication topologies, and traditional electrical networks have difficulty matching them due to their fixed network topology. Neural network models and model partitioning methods can also affect the amount of communication between devices, but the overprovisioned bandwidth of traditional electric networks incurs unnecessary costs. To address these issues, we propose a scalable, flexible, and high-performance network architecture called X-NEST. The flexibility of optical switching devices allows X-NEST to dynamically change its topology and the number of links between devices according to traffic pattern variations, thereby improving network performance and resource utilization. Although changes in the connection relationships between devices depend on the controller, the simple and flexible control plane of X-NEST can quickly respond to network communication requirements. Extensive analytical simulations using different traffic patterns demonstrate that X-NEST copes well with the communication characteristics of various synchronization algorithms.
Full paper
Read the original paper
A direct open-access PDF is not available in the database yet. Use the source page or learning resources below to open the complete paper from the publisher or index.