Large-scale machine learning and data mining applications require computer systems to perform massive computations that need to be parallelized across multiple nodes, for example, massive matrix-vector and matrix-matrix multiplication. The presence of straggling nodes -- computing nodes that unpredictably slowdown or fail -- is a major bottleneck in such distributed computations. We propose a rateless fountain coding strategy to alleviate the problem of stragglers in distributed matrix-vector multiplication. Our algorithm generates linear combinations of the 'm' rows of the matrix, and assigns them to different worker nodes, which then perform row-vector products with the encoded rows. The original matrix-vector product can be decoded as soon as slightly more than ‘m' row-vector products are collectively completed by the nodes. This strategy enables fast nodes to steal work from slow nodes, without requiring the knowledge of node speeds. Compared to recently proposed fixed-rate erasure coding strategies which ignore partial work done by straggling nodes, rateless coding achieves significantly lower overall delay, as well as small computational overhead.
Ankur Mallick is currently a 2nd year PhD student in the ECE Dept. at CMU where he works on developing algorithms for distributed computing and deep learning. He graduated from the Indian Institute of Technology (IIT) Bombay in 2016 with a B.Tech. in Electrical Engineering and a M.Tech. in Communications and Signal Processing. He received the IIT Bombay Undergraduate Research Award (URA 03) for his Master’s Thesis. Between IIT Bombay and CMU, he also spent a year at Sony Corporation, Japan, as an R&D engineer working on signal processing and machine learning algorithms for image sensors.