Disease ID: How We Scaled Our Deep Learning Model with Distributed Training

At Climate LLC, we are applying deep learning to solve problems across various technical and scientific domains. As my colleague Wei describes in his recent post Some Deep Learnings about Applying Deep Learning, one way we are using deep learning is to identify plant disease in farmers’ fields.

We have found that the same bottleneck can arise regardless of domain--training a neural network is a slow process when you have a model with many parameters, or you have a lot of data. This can limit the ability to iterate quickly. One approach we took to speed up training is distributed training. Our Data Science Platform facilitates analytics across all our Climate Fieldview™ data, so enabling distributed training would allow researchers to create scalable pipelines that go from from raw data to predictions. Data fetching, pre-processing, and machine learning on the same platform also encourages a standard set of practices from researchers on different teams who otherwise would not have shared information.

Introduction to Distributed Training

We integrated the open source package Distributed Keras, created by graduate student Joeri Hermans, into our Data Science platform. Distributed Keras implements several data parallel model training procedures. These procedures speed up computation updates by distributing copies of the weights of a model across multiple nodes. Each worker node trains its own model and feeds back its updated set of parameters to a central set of weights after a set number of iterations.

If updates are sent to the driver asynchronously, a node commits its new weights as soon as it is done computing, rather than waiting for all to finish and averaging the gradients. The result: training is faster because there is no bottleneck of waiting for a slow worker to finish. This also means that some nodes are computing updates based on older parameters. It’s important to note, there is a limit to the amount of nodes that are doing computations before the model starts suffering from performance loss.

Top: the training accuracies over time of each worker node. Bottom: Overall validation accuracy

Approach

The main steps we took to operationalizing Distributed Keras at Climate were:

Optimizing hardware for the task. We enabled use of CPU-optimized nodes over the memory-optimized nodes that make up our cluster, and created a custom queue that can spin up GPU nodes with the correct configurations on demand. We also sped up CPU computations by building Tensorflow to use the AVX instructions available on our nodes.
Ensuring Distributed Keras can scale with large amounts of data by enhancing the package so that it would only store batches of data in memory.
Providing adequate documentation and examples to enable teams to use the package with minimal setup. We created sample projects that users could run out-of-the-box.
Incorporating other improvements to enable optimal functionality and flexibility on our data science platform.

Experiments

To benchmark the performance of Distributed Keras against a GPU(s), we trained our Geospatial Science team’s disease identification model. We hoped that Distributed Keras would provide comparable speed-ups, while also being cost effective (in terms of AWS hourly prices), and achieve similar performance to non-distributed training. We trained the model using:

Approximately 10,000 images of corn leaves with multiple diseases
Data augmentation through random cropping, shearing, and rotation
Transfer learning on the final layer of ResNet50
50 epochs
Asynchronous Distributed Adaptive Gradients (ADAG) trainer from Distributed Keras

We trained the model on two different CPU node types. With ten workers or more, training is faster than on a single GPU. Test set accuracy suffers between 2-5% compared to the non-distributed GPU. This is a result of the asynchronous training procedure.

Cost effectiveness is also a consideration with distributed training. Does the added cost of training with an extra node offset the advantages we get in training speeds? Does this approach cost more than simply paying for a multi-GPU? We analyzed the monetary cost of training this model using Amazon’s on-demand node prices as well as the spot price for the nodes averaged over three months.

Spot prices for c4.2xlarge (CPU) and p2.8xlarge (8 GPU) instances (Note: Spot prices are from August to November 2017)

At the time of this experiment, spot prices for the r4 and c4 nodes were consistently around $0.12/hr (the on-demand price is $0.39/hr+). The spot prices for the GPU instances were too volatile to be feasible.

We see that distributed training costs about the same as training on a CPU, but faster, when using spot pricing.The recent EMR 5.10 update also gave us the option of using P2 GPU nodes on our cluster.

We also found the performance is more consistent compared to training on the CPU nodes. This might be because the GPU nodes are faster and thus communicate with the central set of weights more often.

By the time we could train on GPU nodes, the spot price for a single GPU was around $0.32/hr (on-demand price is $0.9/hr). The cost of training varies by only $1. Interestingly, using more workers and training the model faster sometimes ends up costing less.

Distributed training is a valuable tool for us because it fits with the infrastructure that our team uses. Being able to build a pipeline on our Data Science Platform is critical given that data scientists often spend more time collecting and processing data rather than training models. This played a big role in our exploration of Distributed Keras in addition to considerations of training time, performance, and costs.

Join us as we explore and build Deep Learning tools to help all the world’s farmers sustainably increase their productivity with digital tools.

Disease ID: How We Scaled Our Deep Learning Model with Distributed Training

Introduction to Distributed Training

Approach

Experiments

About the Author

Related Articles