CIF: Small: Numerically-Stable Large-Scale Coded Distributed Computing

abstract

In this data-centric age, staggering amounts of data is being collected and processed using computers at unprecedented scales of complexity. In many situations, a single computer/server does not possess enough computational power to process such a large volume of data fast enough to execute the task at hand. A useful approach to speed up the computations is to break the computing task into smaller tasks and to distribute the smaller tasks among many worker nodes. The results from worker nodes can then be assembled at a master node to complete the original computing task. Such a distributed approach has the potential to provide a dramatic speed-up; however, a major limitation of this approach is that even a single straggling worker node can slow down the entire process. Similarly, a single faulty or malicious worker node can introduce errors in the computation, affecting the overall computation task. Coded distributed computing is therefore used to speed up computations even in the presence of stragglers and faulty or malicious nodes. This project will advance the state of the art in the domain of coded distributed computing by designing robust algorithms for distributed computation that provide (i) resilience to straggling workers, (ii) resilience to malicious or faulty workers, and (iii) scalability, i.e., the numerical accuracy and implementation complexity of the designed algorithms will scale efficiently with the number of workers. This project has the potential to have a broad impact on the development of modern computing infrastructures. This project will design and analyze numerically-stable and computationally-efficient coded distributed computing schemes for matrix-matrix multiplication and multivariate polynomial evaluation - two essential tasks frequently used in machine-learning and deep-learning algorithms. The focus will be on systems with a moderate to large number of worker nodes. Novel coding-theoretic ideas will be pursued, and powerful mathematical tools from the rich literature on random linear codes, rateless fountain codes, multi-dimensional product codes, as well as collaborative decoding of algebraic codes will be leveraged. The performance of the schemes developed in the project will be analyzed in terms of the number of straggling worker nodes that can be tolerated, the number of adversarial or random errors in the computations of the worker nodes that can be detected/corrected, the implementation complexity and the numerical stability. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

date/time interval

2020 - 2023

awarded by

Directorate for Computer & Information Science & Engineering - (Arlington, Virginia, United States)

CIF: Small: Numerically-Stable Large-Scale Coded Distributed Computing Grant

Overview

abstract

date/time interval

awarded by

Affiliation

contributor