Research Paper ML Hub

AI research atlas / v2

Learn AI papers in the right order.

Start with landmark ideas, move through foundations, then branch into LLMs, GenAI, agents, systems, and safety with a reading path that keeps the field from feeling random.

10 learning tracksFull-paper readerChatGPT handoff
Recommended firstLandmark papers

Build the mental timeline before going deep.

Then specializeLLMs, GenAI, safety

Move from foundations to modern systems.

Read modePDF + resources
Path-firstNo more random paper hopping
Research-nativearXiv links, PDFs, resources
Study loopTrack reading and discuss in ChatGPT

Learning path

Where to start, and what to read next

Start with landmarks
01

Orientation / 1-2 weeks

Start Here

Read the papers everyone keeps referencing so the rest of the map has anchors.

Know the landmark namesBuild historical contextPick a direction
Open papers
02

Foundations / 2-4 weeks

Classical ML

Learn the statistical and probabilistic ideas that still sit under modern models.

Bayesian thinkingModel evaluationUncertainty
Open papers
03

Foundations / 1-2 weeks

Optimization

Understand the training mechanics behind gradient-based learning.

Gradient descentGeneralizationTraining stability
Open papers
04

Builder / 3-5 weeks

Deep Learning Core

Move through representation learning, CNNs, residual networks, and scaling patterns.

CNN intuitionRepresentation learningBenchmark culture
Open papers
05

Builder / 3-6 weeks

Sequence Models and LLMs

Study attention, transformers, language modeling, instruction tuning, and evaluation.

AttentionPretrainingInstruction following
Open papers
06

Specialist / 3-6 weeks

Generative AI

Compare GANs, diffusion, autoregressive generation, and modern GenAI workflows.

DiffusionGANsGeneration tradeoffs
Open papers
07

Specialist / 2-4 weeks

Multimodal and Retrieval

Connect language with images, retrieval, embeddings, and real-world knowledge access.

Vision-languageEmbeddingsRetrieval
Open papers
08

Specialist / 3-5 weeks

RL and Agents

Learn decision making, feedback, policy learning, and agent-style systems.

PoliciesRewardsExploration
Open papers
09

Practitioner / 2-4 weeks

Systems and Scaling

Understand the infrastructure and engineering papers behind large-scale training.

Distributed trainingServingEfficiency
Open papers
10

Practitioner / 2-4 weeks

Safety and Interpretability

Study robustness, alignment, transparency, and how to reason about model behavior.

AlignmentRobustnessInterpretability
Open papers

Research library

ML Systems

Showing papers for this learning path. Open any paper card to read the full paper and related resources.

40 papers shown
unread2015

Distilling the Knowledge in a Neural Network

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

Geoffrey E. Hinton, Oriol Vinyals, Jay B. Dean 13,925
ML Systems
unread2016

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.

Martín Abadi, Ashish Agarwal, P. Barham 11,666
ML Systems
unread2021

LoRA: Low-Rank Adaptation of Large Language Models

An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

J. Edward Hu, Yelong Shen, Phillip Wallis 2,410
ML Systems
unread2018

The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance. We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective. We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.

Jonathan Frankle, Michael Carbin 1,323
ML Systems
unread2019

A Survey on Distributed Machine Learning

The demand for artificial intelligence has grown significantly over the past decade, and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges: first and foremost, the efficient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available.

Joost Verbraeken, Matthijs Wolting, J. Katzy 851
ML Systems
unread2019

Scaling Distributed Machine Learning with In-Network Aggregation

Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 300%, and at least by 20% for a number of real-world benchmark models.

Amedeo Sapio, Marco Canini, Chen-Yu Ho 526
ML Systems
unread2023

The applications of machine learning techniques in medical data processing based on distributed computing and the Internet of Things

Medical data processing has grown into a prominent topic in the latest decades with the primary goal of maintaining patient data via new information technologies, including the Internet of Things (IoT) and sensor technologies, which generate patient indexes in hospital data networks. Innovations like distributed computing, Machine Learning (ML), blockchain, chatbots, wearables, and pattern recognition can adequately enable the collection and processing of medical data for decision-making in the healthcare era. Particularly, to assist experts in the disease diagnostic process, distributed computing is beneficial by digesting huge volumes of data swiftly and producing personalized smart suggestions. On the other side, the current globe is confronting an outbreak of COVID-19, so an early diagnosis technique is crucial to lowering the fatality rate. ML systems are beneficial in aiding radiologists in examining the incredible amount of medical images. Nevertheless, they demand a huge quantity of training data that must be unified for processing. Hence, developing Deep Learning (DL) confronts multiple issues, such as conventional data collection, quality assurance, knowledge exchange, privacy preservation, administrative laws, and ethical considerations. In this research, we intend to convey an inclusive analysis of the most recent studies in distributed computing platform applications based on five categorized platforms, including cloud computing, edge, fog, IoT, and hybrid platforms. So, we evaluated 27 articles regarding the usage of the proposed framework, deployed methods, and applications, noting the advantages, drawbacks, and the applied dataset and screening the security mechanism and the presence of the Transfer Learning (TL) method. As a result, it was proved that most recent research (about 43%) used the IoT platform as the environment for the proposed architecture, and most of the studies (about 46%) were done in 2021. In addition, the most popular utilized DL algorithm was the Convolutional Neural Network (CNN), with a percentage of 19.4%. Hence, despite how technology changes, delivering appropriate therapy for patients is the primary aim of healthcare-associated departments. Therefore, further studies are recommended to develop more functional architectures based on DL and distributed environments and better evaluate the present healthcare data analysis models.

Sarina Aminizadeh, Arash Heidari, Shiva Toumaj 151
ML Systems
unread2021

Towards Demystifying Serverless Machine Learning Training

The appeal of serverless (FaaS) has triggered a growing interest on how to use it in data-intensive applications such as ETL, query processing, or machine learning (ML). Several systems exist for training large-scale ML models on top of serverless infrastructures (e.g., AWS Lambda) but with inconclusive results in terms of their performance and relative advantage over "serverful" infrastructures (IaaS). In this paper we present a systematic, comparative study of distributed ML training over FaaS and IaaS. We present a design space covering design choices such as optimization algorithms and synchronization protocols, and implement a platform, LambdaML, that enables a fair comparison between FaaS and IaaS. We present experimental results using LambdaML, and further develop an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless infrastructure. Our results indicate that ML training pays off in serverless only for models with efficient (i.e., reduced) communication and that quickly converge. In general, FaaS can be much faster but it is never significantly cheaper than IaaS.

Jiawei Jiang, Shaoduo Gan, Yue Liu 147
ML Systems
unread2021

An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems

The recent many-fold increase in the size of deep neural networks makes efficient distributed training challenging. Many proposals exploit the compressibility of the gradients and propose lossy compression techniques to speed up the communication stage of distributed training. Nevertheless, compression comes at the cost of reduced model quality and extra computation overhead. In this work, we design an efficient compressor with minimal overhead. Noting the sparsity of the gradients, we propose to model the gradients as random variables distributed according to some sparsity-inducing distributions (SIDs). We empirically validate our assumption by studying the statistical characteristics of the evolution of gradient vectors over the training process. We then propose Sparsity-Inducing Distribution-based Compression (SIDCo), a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) while being faster by imposing lower compression overhead. Our extensive evaluation of popular machine learning benchmarks involving both recurrent neural network (RNN) and convolution neural network (CNN) models shows that SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.

A. M. Abdelmoniem, Ahmed Elzanaty, Mohamed-Slim Alouini 99
ML Systems
unread2021

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Recent trends towards large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, the current logical separation between computation and communication kernels in machine learning frameworks misses optimization opportunities across this barrier. Breaking this abstraction can provide many optimizations to improve the performance of distributed workloads. However, manually applying these optimizations requires modifying the underlying computation and communication libraries for each scenario, which is both time consuming and error-prone. Therefore, we present CoCoNet, which contains (i) a domain specific language to express a distributed machine learning program in the form of computation and communication operations, (ii) a set of semantics preserving transformations to optimize the program, and (iii) a compiler to generate jointly optimized communication and computation GPU kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNet enabled us to optimize data-, model- and pipeline-parallel workloads in large language models with only a few lines of code. Our experiments show that CoCoNet significantly outperforms state-of-the-art distributed machine learning implementations.

Abhinav Jangda, Jun Huang, Guodong Liu 95
ML Systems
unread2020

KungFu: Making Training in Distributed Machine Learning Adaptive

No abstract available yet.

Luo Mai, Guo Li, Marcel Wagenländer 91
ML Systems
unread2024

[CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

No abstract available yet.

Qizhe Zhang, Aosong Cheng, Ming Lu 85
ML Systems
unread2021

GRACE: A Compressed Communication Framework for Distributed Machine Learning

Powerful computer clusters are used nowadays to train complex deep neural networks (DNN) on large datasets. Distributed training increasingly becomes communication bound. For this reason, many lossy compression techniques have been proposed to reduce the volume of transferred data. Unfortunately, it is difficult to argue about the behavior of compression methods, because existing work relies on inconsistent evaluation testbeds and largely ignores the performance impact of practical system configurations. In this paper, we present a comprehensive survey of the most influential compressed communication methods for DNN training, together with an intuitive classification (i.e., quantization, sparsification, hybrid and low-rank). Next, we propose GRACE, a unified framework and API that allows for consistent and easy implementation of compressed communication on popular machine learning toolkits. We instantiate GRACE on TensorFlow and PyTorch, and implement 16 such methods. Finally, we present a thorough quantitative evaluation with a variety of DNNs (convolutional and recurrent), datasets and system configurations. We show that the DNN architecture affects the relative performance among methods. Interestingly, depending on the underlying communication library and computational cost of compression / decompression, we demonstrate that some methods may be impractical. GRACE and the entire benchmarking suite are available as open-source.

Hang Xu, Chen-Yu Ho, A. M. Abdelmoniem 85
ML Systems
unread2024

Privacy-Preserving in Blockchain-based Federated Learning Systems

Federated Learning (FL) has recently arisen as a revolutionary approach to collaborative training Machine Learning models. According to this novel framework, multiple participants train a global model collaboratively, coordinating with a central aggregator without sharing their local data. As FL gains popularity in diverse domains, security, and privacy concerns arise due to the distributed nature of this solution. Therefore, integrating this strategy with Blockchain technology has been consolidated as a preferred choice to ensure the privacy and security of participants. This paper explores the research efforts carried out by the scientific community to define privacy solutions in scenarios adopting Blockchain-Enabled FL. It comprehensively summarizes the background related to FL and Blockchain, evaluates existing architectures for their integration, and the primary attacks and possible countermeasures to guarantee privacy in this setting. Finally, it reviews the main application scenarios where Blockchain-Enabled FL approaches have been proficiently applied. This survey can help academia and industry practitioners understand which theories and techniques exist to improve the performance of FL through Blockchain to preserve privacy and which are the main challenges and future directions in this novel and still under-explored context. We believe this work provides a novel contribution respect to the previous surveys and is a valuable tool to explore the current landscape, understand perspectives, and pave the way for advancements or improvements in this amalgamation of Blockchain and Federated Learning.

Sameera K.M., S. Nicolazzo, Marco Arazzi 75
ML Systems
unread2021

Clairvoyant Prefetching for Distributed Machine Learning I/O

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing this I/O bottle-neck necessitates careful optimization, as optimal data ingestion pipelines differ between systems, and require a delicate balance between access to local storage, external filesystems, and remote nodes. We introduce NoPFS, a machine learning I/O middleware, which provides a scalable, flexible, and easy-to-use solution to the I/O bottleneck. NoPFS uses clairvoyance: Given the seed generating the random access pattern for training with SGD, it can exactly predict when and where a sample will be accessed. We combine this with an analysis of access patterns and a performance model to provide distributed caching policies that adapt to different datasets and storage hierarchies. NoPFS reduces I/O times and improves end-to-end training by up to 5.4× on the ImageNet-1k, ImageNet-22k, and CosmoFlow datasets.

Roman Böhringer, Nikoli Dryden, Tal Ben-Nun 72
ML Systems
unread2023

Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

Scaling models to large sizes to improve performance has led a trend in deep learning, and sparsely activated Mixture-of-Expert (MoE) is a promising architecture to scale models. However, training MoE models in existing systems is expensive, mainly due to the All-to-All communication between layers. All-to-All communication originates from expert-centric paradigm: keeping experts in-place and exchanging intermediate data to feed experts. We propose the novel data-centric paradigm: keeping data in-place and moving experts between GPUs. Since experts' size can be smaller than the size of data, data-centric paradigm can reduce communication workload. Based on this insight, we develop Janus. First, Janus supports fine-grained asynchronous communication, which can overlap computation and communication. Janus implements a hierarchical communication to further reduce cross-node traffic by sharing the fetched experts in the same machine. Second, when scheduling the "fetching expert" requests, Janus implements a topology-aware priority strategy to utilize intra-node and inter-node links efficiently. Finally, Janus allows experts to be prefetched, which allows the downstream computation to start immediately once the previous step completes. Evaluated on a 32-A100 cluster, Janus can reduce the traffic up to 16× and achieves up to 2.06× speedup compared with current MoE training system.

Juncai Liu, Jessie Hui Wang, Yimin Jiang 69
ML Systems
unread2019

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been applied to many real-world problems, ranging from computer vision to natural language processing. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to train a large model over large datasets. A popular solution is to distribute and parallelize the training process across multiple machines using the parameter server framework. In this paper, we present a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-the-art Stale Synchronous Parallel (SSP) paradigm by dynamically determining the staleness threshold at the run time. Conventionally to run distributed training in SSP, the user needs to specify a particular stalenes threshold as a hyper-parameter. However, a user does not usually know how to set the threshold and thus often finds a threshold value through trial and error, which is time-consuming. Based on workers' recent processing time, our approach DSSP adaptively adjusts the threshold per iteration at running time to reduce the waiting time of faster workers for synchronization of the globally shared parameters (the weights of the model), and consequently increases the frequency of parameters updates (increases iteration through-put), which speedups the convergence rate. We compare DSSP with other paradigms such as Bulk Synchronous Parallel (BSP), Asynchronous Parallel (ASP), and SSP by running deep neural networks (DNN) models over GPU clusters in both homogeneous and heterogeneous environments. The results show that in a heterogeneous environment where the cluster consists of mixed models of GPUs, DSSP converges to a higher accuracy much earlier than SSP and BSP and performs similarly to ASP. In a homogeneous distributed cluster, DSSP has more stable and slightly better performance than SSP and ASP, and converges much faster than BSP.

Xing Zhao, Aijun An, Junfeng Liu 66
ML Systems
unread2024

Deep Learning or Classical Machine Learning? An Empirical Study on Log-Based Anomaly Detection

While deep learning (DL) has emerged as a powerful technique, its benefits must be carefully considered in relation to computational costs. Specifically, although DL methods have achieved strong performance in log anomaly detection, they often require extended time for log preprocessing, model training, and model inference, hindering their adoption in online distributed cloud systems that require rapid deployment of log anomaly detection service. This paper investigates the superiority of DL methods compared to simpler techniques in log anomaly detection. We evaluate basic algorithms (e.g., KNN, SLFN) and DL approaches (e.g., CNN) on five public log anomaly detection datasets (e.g., HDFS). Our findings demonstrate that simple algorithms outperform DL methods in both time efficiency and accuracy. For instance, on the Thunderbird dataset, the K-nearest neighbor algorithm trains 1,000 times faster than NeuralLog while achieving a higher F1-Score by 0.0625. We also identify three factors contributing to this phenomenon, which are: (1) redundant log preprocessing strategies, (2) dataset simplicity, and (3) the nature of binary classification in log anomaly detection. To assess the necessity of DL, we propose LightAD, an architecture that optimizes training time, inference time, and performance score. With automated hyper-parameter tuning, LightAD allows fair comparisons among log anomaly detection models, enabling engineers to evaluate the suitability of complex DL methods. Our findings serve as a cautionary tale for the log anomaly detection community, highlighting the need to critically analyze datasets and research tasks before adopting DL approaches. Researchers proposing computationally expensive models should benchmark their work against lightweight algorithms to ensure a comprehensive evaluation.

Boxi Yu, Jiayi Yao, Qiuai Fu 63
ML Systems
unread2020

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

S. Shi, Xianhao Zhou, Shutao Song 60
ML Systems
unread2022

A Distributed Intrusion Detection System using Machine Learning for IoT based on ToN-IoT Dataset

—The internet of things (IoT) is a collection of common physical things which can communicate and synthesize data utilizing network infrastructure by connecting to the internet. IoT networks are increasingly vulnerable to security breaches as their popularity grows. Cyber security attacks are among the most popular severe dangers to IoT security. Many academics are increasingly interested in enhancing the security of IoT systems. Machine learning (ML) approaches were employed to function as intrusion detection systems (IDSs) to provide better security capabilities. This work proposed a novel distributed detection system based on machine ML approaches to detect attacks in IoT and mitigate malicious occurrences. Furthermore, NSL-KDD or KDD-CUP99 datasets are used in the great majority of current studies. These datasets are not updated with new attacks. As a consequence, the ToN-IoT dataset was used for training and testing. It was created from a large-scale, diverse IoT network. The ToN-IoT dataset reflects data from each layer of the IoT system, such as cloud, fog, and edge layer. Various ML methods were tested in each specific partition of the ToN-IoT dataset. The proposed model is the first suggested model based on the collected data from the same IoT system from all layers. The Chi2 technique was used to pick features in a network dataset. It reduced the number of features to 20. Another feature selection tool employed in the windows dataset was the correlation matrix, which was used to extract the most relevant features from the whole dataset. To balance the classes, the SMOTE method was used. This paper tests numerous ML approaches in both binary and multi-class classification problems. According to the findings, the XGBoost approach is superior to other ML algorithms for each node in the suggested model

Abdallah R. Gad, M. Haggag, Ahmed A. Nashat 57
ML Systems
unread2021

On the Utility of Gradient Compression in Distributed Training Systems

A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent work proposes gradient and model compression methods. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD across more than 200 different setups. Surprisingly, we observe that only in 6 cases out of more than 200, gradient compression methods provide speedup over optimized synchronous data-parallel training in the typical data-center setting. We conduct an extensive investigation to identify the root causes of this phenomenon, and offer a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Based on our analysis, we propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide a meaningful end-to-end speedup.

Saurabh Agarwal, Hongyi Wang, S. Venkataraman 57
ML Systems
unread2022

Dynamic GPU Energy Optimization for Machine Learning Training Workloads

GPUs are widely used to accelerate the training of machine learning workloads. As the machine learning models become increasingly larger, they require a longer time to train, which in turn leads to higher GPU energy consumption. This paper presents GPOEO, an online GPU energy optimization framework for machine learning training workloads. GPOEO dynamically determines the optimal energy configuration by employing a set of novel techniques for online measurement, multi-objective prediction modeling, and search optimization. To characterize the target workload behavior, GPOEO utilizes GPU performance counters. To reduce the performance counter profiling overhead, it uses an analytical model to detect the change of training iteration and only reprofile the performance counter when an iteration shift is detected. Then we use multi-objective models, based on the gradient boosting method, and a local search algorithm, to find a trade-off between execution time and energy consumption. We evaluate the GPOEO by applying it to 71 machine learning workloads from two AI benchmark suites on an NVIDIA RTX3080Ti GPU. Compared with the NVIDIA default scheduling strategy, GPOEO delivers a mean energy saving of 16.2% with an average execution time increase of 5.1%.

Farui Wang, Weizhe Zhang, Shichao Lai 48
ML Systems
unread2019

Distributed Machine Learning through Heterogeneous Edge Systems

Many emerging AI applications request distributed machine learning (ML) among edge systems (e.g., IoT devices and PCs at the edge of the Internet), where data cannot be uploaded to a central venue for model training, due to their large volumes and/or security/privacy concerns. Edge devices are intrinsically heterogeneous in computing capacity, posing significant challenges to parameter synchronization for parallel training with the parameter server (PS) architecture. This paper proposes ADSP, a parameter synchronization model for distributed machine learning (ML) with heterogeneous edge systems. Eliminating the significant waiting time occurring with existing parameter synchronization models, the core idea of ADSP is to let faster edge devices continue training, while committing their model updates at strategically decided intervals. We design algorithms that decide time points for each worker to commit its model update, and ensure not only global model convergence but also faster convergence. Our testbed implementation and experiments show that ADSP outperforms existing parameter synchronization models significantly in terms of ML model convergence time, scalability and adaptability to large heterogeneity.

Han Hu, Dan Wang, Chuan Wu 48
ML Systems
unread2021

DC2: Delay-aware Compression Control for Distributed Machine Learning

Distributed training performs data-parallel training of DNN models which is a necessity for increasingly complex models and large datasets. Recent works are identifying major communication bottlenecks in distributed training. These works seek possible opportunities to speed-up the training in systems supporting distributed ML workloads. As communication reduction, compression techniques are proposed to speed up this communication phase. However, compression comes at the cost of reduced model accuracy, especially when compression is applied arbitrarily. Instead, we advocate a more controlled use of compression and propose DC2, a delay-aware compression control mechanism. DC2 couples compression control and network delays in applying compression adaptively. DC2 not only compensates for network variations but can also strike a better trade-off between training speed and accuracy. DC2 is implemented as a drop-in module to the communication library used by the ML toolkit and can operate in a variety of network settings. We empirically evaluate DC2 in network environments exhibiting low and high delay variations. Our evaluation of different popular CNN models and datasets shows that DC2 improves training speed-ups of up to 41× and 5.3 × over baselines with no-compression and uniform compression, respectively.

A. M. Abdelmoniem, Marco Canini 44
ML Systems
unread2019

Attention Is All You Need for Chinese Word Segmentation

Taking greedy decoding algorithm as it should be, this work focuses on further strengthening the model itself for Chinese word segmentation (CWS), which results in an even more fast and more accurate CWS model. Our model consists of an attention only stacked encoder and a light enough decoder for the greedy segmentation plus two highway connections for smoother training, in which the encoder is composed of a newly proposed Transformer variant, Gaussian-masked Directional (GD) Transformer, and a biaffine attention scorer. With the effective encoder design, our model only needs to take unigram features for scoring. Our model is evaluated on SIGHAN Bakeoff benchmark datasets. The experimental results show that with the highest segmentation speed, the proposed model achieves new state-of-the-art or comparable performance against strong baselines in terms of strict closed test setting.

Sufeng Duan, Hai Zhao 40
ML Systems
unread2019

Model poisoning attacks against distributed machine learning systems

Future military coalition operations will increasingly rely on machine learning (ML) methods to improve situational awareness. The coalition context presents unique challenges for ML: the tactical environment creates significant computing and communications limitations while also having to deal with an adversarial presence. Further, coalition operations must operate in a distributed manner, while coping with the constraints posed by the operational environment. Envisioned ML deployments in military assets must be resilient to these challenges. Here, we focus on the susceptibility of ML models to be poisoned (during training) or fooled (after training) by adversarial inputs. We review recent work on distributed adversarial ML, and present new results from our own investigations into model poisoning attacks on distributed learning systems without a central parameter aggregation node.

Richard J. Tomsett, K. Chan, Supriyo Chakraborty 35
ML Systems
unread2019

Performance Analysis and Comparison of Distributed Machine Learning Systems

Deep learning has permeated through many aspects of computing/processing systems in recent years. While distributed training architectures/frameworks are adopted for training large deep learning models quickly, there has not been a systematic study of the communication bottlenecks of these architectures and their effects on the computation cycle time and scalability. In order to analyze this problem for synchronous Stochastic Gradient Descent (SGD) training of deep learning models, we developed a performance model of computation time and communication latency under three different system architectures: Parameter Server (PS), peer-to-peer (P2P), and Ring allreduce (RA). To complement and corroborate our analytical models with quantitative results, we evaluated the computation and communication performance of these system architectures of the systems via experiments performed with Tensorflow and Horovod frameworks. We found that the system architecture has a very significant effect on the performance of training. RA-based systems achieve scalable performance as they successfully decouple network usage from the number of workers in the system. In contrast, 1PS systems suffer from low performance due to network congestion at the parameter server side. While P2P systems fare better than 1PS systems, they still suffer from significant network bottleneck. Finally, RA systems also excel by virtue of overlapping computation time and communication time, which PS and P2P architectures fail to achieve.

S. Alqahtani, Murat Demirbas 32
ML Systems
unread2024

Rule-Based With Machine Learning IDS for DDoS Attack Detection in Cyber-Physical Production Systems (CPPS)

Recent advancements in communication technology have transformed the way the industrial system works. This digitalization has improved the way of communication between different actors involved in cyber physical production systems (CPPS), such as users, suppliers, and manufacturers, thus making the whole process transparent. The utilization of emerging new technologies in CPPS can cause vulnerable spots that can be exploited by attackers to launch sophisticated distributed denial of service (DDoS) attacks, hence threatening the availability of the production systems. Existing machine learning based intrusion detection systems (IDS) often rely on unrealistic datasets for training and validation, thus missing the crucial testing phase with real-time scenarios. The results generated by the ML models are based on predictions at each flow level and cannot provide summarized information about malicious entities. To address this limitation, this study proposed an efficient IDS system that uses both rule-based detection and ML-based approaches to detect DDoS attacks damaging the infrastructure of CPPS. For training and validation of the system, we use real-time network traffic extracted from a real industrial scenario, referred to as Farm-to-Fork (F2F) supply chain system. Both, attacks and normal traffic were captured, and bidirectional features were extracted through CIC-FLOWMETER. We make use of 8 ML supervised and unsupervised approaches to detect the malicious flows; and then a rule-based detection mechanism is used to calculate the frequency of the malicious flows and to assign different severity levels based on the computed frequency. The overall results show that supervised models outperform unsupervised approaches and achieve an accuracy 99.97% and TPR 99.96%. Overall, the weighted accuracy when tested and deployed in a real-time scenario is around 98.71%. The results prove that the system works better when considering real-time scenarios and provides comprehensive information about the detected results that can be used to take different mitigation actions.

Ayaz Hussain, Eva Marín Tordera, X. Masip-Bruin 30
ML Systems
unread2023

AMPeD: An Analytical Model for Performance in Distributed Training of Transformers

Transformers are a class of machine learning models that have piqued high interest recently due to a multitude of reasons. They can process multiple modalities efficiently and have excellent scalability. Despite these obvious advantages, training these large models is very time-consuming. Hence, there have been efforts to speed up the training process using efficient distributed implementations. Many different types of parallelism have been identified that can be employed standalone or in combination. However, naively combining different parallelization schemes can incur significant communication overheads, thereby potentially defeating the purpose of distributed training. Thus, it becomes vital to predict the right mapping of different parallelisms to the underlying system architecture. In this work, we propose AMPeD, an analytical model for performance in distributed training of transformers. It exposes all the transformer model parameters, potential parallelism choices (along with their mapping onto the system), the accelerator as well as system architecture specffications as tunable knobs, thereby enabling hardware-software co-design. With the help of 3 case studies, we show that the combinations of parallelisms predicted to be efficient by AMPeD conform with the results from the state-of-the-art literature. Using AMPeD, we also show that future distributed systems consisting of optical communication substrates can train large models up to 4× faster as compared to the current state-of the-art systems without modifying the peak computational power of the accelerators. Finally, we validate AMPeD with in-house experiments on real systems and via published literature. The max. observed error is limited to 12%. The model is available here: https://github.com/CSA-infra/AMPeD

Diksha Moolchandani, Joyjit Kundu, F. Ruelens 30
ML Systems
unread2020

Apache Mahout: Machine Learning on Distributed Dataflow Systems

No abstract available yet.

Robin Anil, Gökhan Çapan, Isabel Drost-Fromm 27
ML Systems
unread2021

Mixing Activations and Labels in Distributed Training for Split Learning

Split Learning (SL) is a distributed machine learning setting that allows several nodes to train neural networks based on model parallelism. Since SL avoids sharing raw data among training nodes, it can protect data privacy by nature. However, recent studies show that, raw data may be reconstructed from activations in training, which may cause data privacy leakage. Besides raw data, label sharing in SL may also cause privacy problems. In order to address these issues, we propose a novel mechanism called multiple activations and labels mix (MALM). By taking advantage of the diversity of sample categories, MALM generates mixed activations that preserve a low distance correlation with the raw data so as to reduce the risk of reconstruction attacks. To protect label information, MALM creates obfuscated labels associated with the raw data so as to prevent adversaries from inferring ground-truth labels. Since clients with few sample categories may not effectively generate mixed activations and obfuscated labels, we propose a bipartite graph based assistant client match technique for MALM, which lets clients with a large number of categories provide mixed activations and obfuscated labels for clients with few categories. Those clients with few categories can mix the obtained mixed activations and obfuscated labels with their own activations and labels. Experimental results show that, compared with baselines, MALM can reduce the risk of raw data and label information leakage with lower cost, while achieving comparable even better model performance.

Danyang Xiao, Chengang Yang, Weigang Wu 26
ML Systems
unread2024

Objective metrics for ethical AI: a systematic literature review

The field of AI Ethics has recently gained considerable attention, yet much of the existing academic research lacks practical and objective contributions for the development of ethical AI systems. This systematic literature review aims to identify and map objective metrics documented in literature between January 2018 and June 2023, specifically focusing on the ethical principles outlined in the Ethics Guidelines for Trustworthy AI. The review was based on 66 articles retrieved from the Scopus and World of Science databases. The articles were categorized based on their alignment with seven ethical principles: Human Agency and Oversight, Technical Robustness and Safety, Privacy and Data Governance, Transparency, Diversity, Non-Discrimination and Fairness, Societal and Environmental Well-being, and Accountability. Of the identified articles, only a minority presented objective metrics to assess AI ethics, with the majority being purely theoretical works. Moreover, existing metrics are primarily concentrating on Diversity, Non-Discrimination and Fairness, with a clear under-representation of the remaining principles. This lack of practical contributions makes it difficult for Data Scientists to devise systems that can be deemed Ethical, or to monitor the alignment of existing systems with current guidelines and legislation. With this work, we lay out the current panorama concerning objective metrics to quantify AI Ethics in Data Science and highlight the areas in which future developments are needed to align Data Science projects with the human values widely posited in the literature.

Guilherme Palumbo, Davide Carneiro, Victor Alves 21
ML Systems
unread2023

DC-SHAP Method for Consistent Explainability in Privacy-Preserving Distributed Machine Learning

Ensuring the transparency of machine learning models is vital for their ethical application in various industries. There has been a concurrent trend of distributed machine learning designed to limit access to training data for privacy concerns. Such models, trained over horizontally or vertically partitioned data, present a challenge for explainable AI because the explaining party may have a biased view of background data or a partial view of the feature space. As a result, explanations obtained from different participants of distributed machine learning might not be consistent with one another, undermining trust in the product. This paper presents an Explainable Data Collaboration Framework based on a model-agnostic additive feature attribution algorithm (KernelSHAP) and Data Collaboration method of privacy-preserving distributed machine learning. In particular, we present three algorithms for different scenarios of explainability in Data Collaboration and verify their consistency with experiments on open-access datasets. Our results demonstrated a significant (by at least a factor of 1.75) decrease in feature attribution discrepancies among the users of distributed machine learning. The proposed method improves consistency among explanations obtained from different participants, which can enhance trust in the product and enable ethical application in various industries.

A. Bogdanova, A. Imakura, T. Sakurai 20
ML Systems
unread2023

MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively.

Samuel Hsia, Alicia Golden, Bilge Acun 19
ML Systems
unread2022

Preemptive Scheduling for Distributed Machine Learning Jobs in Edge-Cloud Networks

No abstract available yet.

Ne Wang, Ruiting Zhou, Lei Jiao 17
ML Systems
unread2018

Communication Scheduling as a First-Class Citizen in Distributed Machine Learning Systems

No abstract available yet.

Sayed Hadi Hashemi, S. Jyothi, R. Campbell 17
ML Systems
unread2024

Assemblage: Automatic Binary Dataset Construction for Machine Learning

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of benign binaries for modern systems has proven challenging (e.g., due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpora (e.g., VirusTotal) or open-source binaries (e.g., coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish"recipes"for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpora of high-quality Windows PE binaries in training modern learning-based binary analyses. Assemblage code is open sourced under the MIT license, and the dataset can be downloaded from https://assemblage-dataset.net

Chang Liu, Rebecca Saul, Yihao Sun 16
ML Systems
unread2024

Improving Discharge Predictions in Ungauged Basins: Harnessing the Power of Disaggregated Data Modeling and Machine Learning

Current machine learning methods for discharge prediction often employ aggregated basin‐wide hydrometeorological data (lumped modeling) for parametric and non‐parametric training. This approach may overlook the spatial heterogeneity of river systems and their impact on discharge patterns. We hypothesize that integrating spatiotemporal hydrologic knowledge into the data modeling process (distributed/disaggregated modeling) can improve the performance of discharge prediction models. To test this hypothesis, we designed experiments comparing the performance of identical Long Short‐Term Memory Recurrent Neural Network (LSTM‐RNN) models forced with either lumped or distributed features. We gather meteorological forcing and static attributes for the Mackenzie basin in Canada‐ a large and unique basin. Importantly, discharge performance is assessed out‐of‐sample with k‐fold replication across gauges. Training LSTMs with disaggregated data significantly improved model accuracy. Specifically, there was a 9.6% increase in the mean Nash‐Sutcliffe Efficiency and a 4.6% increase in the mean Kling‐Gupta Efficiency, indicating a better agreement between predicted and actual observations in terms of mean, variability, and correlation. These experiments and results demonstrate the importance of integrating topologically guided geomorphologic and hydrologic information (distributed modeling) in data‐driven discharge predictions.

Aggrey Muhebwa, C. Gleason, D. Feng 16
ML Systems
unread2021

X-NEST: A Scalable, Flexible, and High-Performance Network Architecture for Distributed Machine Learning

In a large-scale distributed machine learning system, the interconnection network between computing devices has an important impact on performance in the training of neural network models. The current expansion of training data and model size has led to a rapid increase in the number of computing devices used in distributed machine learning systems, which places higher demands on network scalability. In addition, the synchronization algorithms used for data exchange between computing devices have different communication topologies, and traditional electrical networks have difficulty matching them due to their fixed network topology. Neural network models and model partitioning methods can also affect the amount of communication between devices, but the overprovisioned bandwidth of traditional electric networks incurs unnecessary costs. To address these issues, we propose a scalable, flexible, and high-performance network architecture called X-NEST. The flexibility of optical switching devices allows X-NEST to dynamically change its topology and the number of links between devices according to traffic pattern variations, thereby improving network performance and resource utilization. Although changes in the connection relationships between devices depend on the controller, the simple and flexible control plane of X-NEST can quickly respond to network communication requirements. Extensive analytical simulations using different traffic patterns demonstrate that X-NEST copes well with the communication characteristics of various synchronization algorithms.

Yunfeng Lu, Huaxi Gu, Xiaoshan Yu 16
ML Systems
unread2024

When In-Network Computing Meets Distributed Machine Learning

Emerging In-Network Computing (INC) technique provides a new opportunity to improve application’s performance by using network programmability, computational capability, and storage capacity enabled by programmable switches. One typical application is Distributed Machine Learning (DML), which accelerates machine learning training by employing multiple works to train model parallelly. This paper introduces INC-based DML systems, analyzes performance improvement from using INC, and overviews current studies of INC-based DML systems. We also propose potential research directions for applying INC to DML systems.

Haowen Zhu, Wenchao Jiang, Qi Hong 15
ML Systems