Symposium on Networked Systems Design and Implementation / 2019

Scaling Distributed Machine Learning with In-Network Aggregation

Amedeo Sapio, Marco Canini, Chen-Yu Ho, J. Nelson, Panos Kalnis, Changhoon Kim, A. Krishnamurthy, Masoud Moshref, Dan R. K. Ports, Peter Richtárik

Foundation ModelsLarge Language ModelsML SystemsPopular and Landmark Papers

Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 300%, and at least by 20% for a number of real-world benchmark models.

526 citations83 influential

Full paper

Read the original paper

Source page

A direct open-access PDF is not available in the database yet. Use the source page or learning resources below to open the complete paper from the publisher or index.

Learning resources

Google Scholar referencesGoogle Scholar Papers with Code searchPapers with Code Semantic Scholar paper pageSemantic Scholar YouTube explanationsYouTube

Reading state

Discuss in ChatGPT

Uses your own ChatGPT account. The paper context is copied into a tutor prompt before ChatGPT opens.

Preview prompt

You are my AI/ML research paper instructor. I want to deeply understand the paper below.

First, teach it in layers:
1. One-paragraph intuition.
2. Problem statement and why it mattered.
3. Key method, architecture, or algorithm.
4. Important equations or mechanisms, explained intuitively.
5. Experiments and evidence.
6. Limitations, assumptions, and failure modes.
7. How this paper influenced later AI/ML/Deep Learning/GenAI work.
8. A 30-minute study plan with checkpoints.
9. Quiz me with 5 questions and wait for my answers.

When something is not available in the attached context, say what is missing and infer carefully.

### Paper attached as context
Title: Scaling Distributed Machine Learning with In-Network Aggregation
Authors: Amedeo Sapio, Marco Canini, Chen-Yu Ho, J. Nelson, Panos Kalnis, Changhoon Kim, A. Krishnamurthy, Masoud Moshref, Dan R. K. Ports, Peter Richtárik
Year: 2019
Venue: Symposium on Networked Systems Design and Implementation
Categories: Foundation Models, Large Language Models, ML Systems, Popular and Landmark Papers
Citations: 526
Paper URL: https://www.semanticscholar.org/paper/dd0de3977cccff806944d5dc28e053fcd8fdda52
Open PDF: Not available

Abstract:
Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 300%, and at least by 20% for a number of real-world benchmark models.

Learning resources:
- Google Scholar: Google Scholar references (https://scholar.google.com/scholar?q=Scaling%20Distributed%20Machine%20Learning%20with%20In-Network%20Aggregation)
- Papers with Code: Papers with Code search (https://paperswithcode.com/search?q=Scaling%20Distributed%20Machine%20Learning%20with%20In-Network%20Aggregation)
- Semantic Scholar: Semantic Scholar paper page (https://www.semanticscholar.org/paper/dd0de3977cccff806944d5dc28e053fcd8fdda52)
- YouTube: YouTube explanations (https://www.youtube.com/results?search_query=Scaling%20Distributed%20Machine%20Learning%20with%20In-Network%20Aggregation+paper+explained)