Conference on Machine Learning and Systems / 2020

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

S. Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xuelin Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin Ouyang

Foundation ModelsLarge Language ModelsML SystemsPopular and Landmark Papers

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

60 citations7 influential

Full paper

Read the original paper

Source page

A direct open-access PDF is not available in the database yet. Use the source page or learning resources below to open the complete paper from the publisher or index.

Learning resources

Google Scholar referencesGoogle Scholar Papers with Code searchPapers with Code Semantic Scholar paper pageSemantic Scholar YouTube explanationsYouTube

Reading state

Discuss in ChatGPT

Uses your own ChatGPT account. The paper context is copied into a tutor prompt before ChatGPT opens.

Preview prompt

You are my AI/ML research paper instructor. I want to deeply understand the paper below.

First, teach it in layers:
1. One-paragraph intuition.
2. Problem statement and why it mattered.
3. Key method, architecture, or algorithm.
4. Important equations or mechanisms, explained intuitively.
5. Experiments and evidence.
6. Limitations, assumptions, and failure modes.
7. How this paper influenced later AI/ML/Deep Learning/GenAI work.
8. A 30-minute study plan with checkpoints.
9. Quiz me with 5 questions and wait for my answers.

When something is not available in the attached context, say what is missing and infer carefully.

### Paper attached as context
Title: Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters
Authors: S. Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xuelin Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin Ouyang
Year: 2020
Venue: Conference on Machine Learning and Systems
Categories: Foundation Models, Large Language Models, ML Systems, Popular and Landmark Papers
Citations: 60
Paper URL: https://www.semanticscholar.org/paper/f57db358459390590bc838663025dae0f8d51ebf
Open PDF: Not available

Abstract:
Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

Learning resources:
- Google Scholar: Google Scholar references (https://scholar.google.com/scholar?q=Towards%20Scalable%20Distributed%20Training%20of%20Deep%20Learning%20on%20Public%20Cloud%20Clusters)
- Papers with Code: Papers with Code search (https://paperswithcode.com/search?q=Towards%20Scalable%20Distributed%20Training%20of%20Deep%20Learning%20on%20Public%20Cloud%20Clusters)
- Semantic Scholar: Semantic Scholar paper page (https://www.semanticscholar.org/paper/f57db358459390590bc838663025dae0f8d51ebf)
- YouTube: YouTube explanations (https://www.youtube.com/results?search_query=Towards%20Scalable%20Distributed%20Training%20of%20Deep%20Learning%20on%20Public%20Cloud%20Clusters+paper+explained)