Subscribe to our Newsletter

Implementing a Distributed Deep Learning Network over Spark

Authors: Dr. Vijay Srinivas Agneeswaran, Director and Head, Big Data Labs, Impetus {[email protected]}

and Ghousia Parveen Taj, Lead Software Engineer, Impetus { [email protected]}

Deep learning is becoming an important AI paradigm for pattern recognition, image/video processing and fraud detection applications in finance. The computational complexity of a deep learning network dictates need for a distributed realization. Our intention is to parallelize the training phase of the network and consequently reduce training time. We have built the first prototype of our distributed deep learning network over Spark, which has emerged as a de-facto standard for realizing machine learning at scale.


Geoffrey Hinton presented the paradigm for fast learning in a deep belief network [Hinton 2006]. This paper, with the advent of GPUs and widespread availability of computing power, led to the breakthrough in this field. Consequently, every big software technology company is working on deep learning and every other startup is using it. A number of applications are being realized over it, including in various fields such as credit card fraud detection (see for example Deep Learning Analytics from Fico), multi-modal information processing etc. This excludes areas such as speech recognition and image processing, which have been already transformed by the application of deep learning [Deng 2013].

The team at Google lead by Geoffrey Dean came up with the first implementation of distributed deep learning [Dean 2012]. Architecturally, it was a pseudo-central realization, with a centralized parameter server being a single source of parameter values across the distributed system. Oxdata has recently released its H20 software which also comprises a deep learning network in addition to several other machine learning algorithms. They have also made the H20 software to work over Spark, as evident from this blog on Sparkling Water. To the extent we have explored, only the Microsoft project Adam comes close to a fully distributed realization of a deep learning network .

Distributed Deep Learning over Spark

Spark is the next generation Hadoop framework from the UC Berkeley and Databricks teams – even the Hadoop vendors have started bundling and distributing Spark with Hadoop versions.  Currently, there is no deep learning implementation either in MLLib, the machine learning library on top of Spark or outside of MLLib that we are aware of.

We have implemented a stacked Restricted Boltzman Machines as a deep belief network, similar to this paper [Roux 2008]. The architecture of our deep learning network over Spark is given in the diagram below.

 To achieve our desires of a fully distributed deep learning implementation, we have relied on Hadoop's distributed file system and Spark's in memory computation for our parallel training.

The input dataset is stored as a HDFS file, and thus distributed across the cluster. Each node in the cluster runs a Akka Actor. The role of this “Actor” is to share the training results on this node with every other node in the cluster. On receiving a request to train the network, the deep learning framework initializes the initial set of weight matrix, and the same is made available on every node's local file system. The training phase is nothing but a Spark application that loads the input file in HDFS to Spark RDD. Once training for a single RDD partition is complete, the results (weight matrix) is written to HDFS, and the local Actor publishes the update message to every other node on the cluster. On receiving the update message, Actors on other nodes copy the weight matrix and updates the local weight matrix accordingly. For subsequent, partitions the updated weight matrix is used. The training output will be the final weight matrix after training every dataset block.

We have eliminated the central parameter server from the deeplearning4j, which itself is a realization of Geoffrey Dean’s paper. We have built a publish-subscribe system which is implemented using the Akka framework over Spark – this is responsible for distributing the learning across the different nodes. The weight matrix represents this learning and is shared at a location in the HDFS in our current implementation – we may augment the Akka distributed queue to take a larger file in future.

Concluding Remarks

This is the first attempt at realizing a distributed deep learning network directly over Spark, to our best knowledge. We shall be augmenting the first cut implementation with more work especially w.r.t achieving high accuracy of the deep learning network. We shall also be building a few applications including image search and NLP (to provide natural language interface to relational queries) to show case the power of our deep learning platform.



[Dean 2012]  Dean, Jeffrey, et al. “Large scale distributed deep networks.” Advances in Neural Information Processing Systems. 2012.

[Deng 2013] Li Deng and Long Yu, Deep Learning : Methods and Applications, Foundations and Trends in Signal Processing, vol 7, no. 3, pages 197-387, 2013.

[Hinton 2006] Hinton, G. E., Osindero, S. and Teh, Y. A fast learning algorithm for deep belief nets, Neural Computation, 18, pages 1527-1554.

[Roux 2008] Le Roux, Nicolas, and Yoshua Bengio. “Representational power of restricted boltzmann machines and deep belief networks.” Neural Computation 20.6 (2008): 1631-1649.



Originally posted on Data Science Central by  Dr. Vijay Srinivas Agneeswaran.

E-mail me when people leave their comments –

You need to be a member of Data Plumbing to add comments!

Join Data Plumbing

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Data Science Jobs