Research Paper ML Hub

Symposium on Networked Systems Design and Implementation / 2019

Scaling Distributed Machine Learning with In-Network Aggregation

Amedeo Sapio, Marco Canini, Chen-Yu Ho, J. Nelson, Panos Kalnis, Changhoon Kim, A. Krishnamurthy, Masoud Moshref, Dan R. K. Ports, Peter Richtárik

Foundation ModelsLarge Language ModelsML SystemsPopular and Landmark Papers

Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 300%, and at least by 20% for a number of real-world benchmark models.

526 citations83 influential

Full paper

Read the original paper

A direct open-access PDF is not available in the database yet. Use the source page or learning resources below to open the complete paper from the publisher or index.