Deep learning has become more and more important to academic research, and cloud computing makes it cheaper, more efficient, and more accessible to all institutions. But most of the research into using cloud computing for deep learning has focused on on-demand servers, which allow researchers to reserve processing units for their exclusive use for as long as they need them.
A team of computer scientists at Worcester Polytechnic Institute (WPI) set out to investigate whether it was feasible to train deep learning models on transient servers, which are cheaper but can be pre-empted, or revoked at any time. Using Google’s pre-emptible virtual machine instances VMs, Tian Guo and Robert Walls, both Assistant Professors of Computer Science at WPI, joined up with Ph.D. student Shijian Li and their collaborator Lijie Xu to conduct one of the first large-scale empirical studies on how to utilize transient servers to get the benefits of distributed training while avoiding the challenges of revocation. "Our high level goal is to provide more efficient training for deep learning researchers. We know that training can take a long time and cost a lot of money if you don’t do it carefully," says Guo.
"We want to choose the tools that people use the most so we chose TensorFlow as the gateway to our project"Tian Guo, Assistant Professor of Computer Science and Data Science, Worcester Polytechnic Institute
Making distributed deep learning training more efficient
The team conducted their study on Google Compute Engine with three configurations of GPU capacity: K80, P100, and V100, using the maximum available memory and CPU values for each one. Guo reports that they chose TensorFlow as their platform “because of its popularity with the deep learning community. We want to choose the tools that people use the most so we chose TensorFlow as the gateway to our project." For Li, it was important that "Google's pre-emptible VMs are customizable based on the type and amount of resources you need. That's a really good feature." For the study the team used the same parameter server with an on-demand instance and the same training dataset to control for variations, measuring training time, training cost, and accuracy to assess the best combination of results. Their first experiment decisively proved that using a cluster of eight K80 pre-emptible VMs was up to 7.7 times faster and 62.9% cheaper on average than running the same distributed training on a single K80 on-demand server. This was true even when some servers were revoked during training.
The team noticed something else though: revocations only negatively affected training time, but had little impact on accuracy or costs. If the distributed training frameworks were redesigned, they reasoned, then they could improve their results even more. In their next experiments they investigated how to optimize the process by considering other variables, like adding more GPUs or mixing K80, P100, and V100 servers in the same training cluster.
Making distributed deep learning training more responsive
The key, the team discovered, is to make the framework more responsive: if customers could control which cloud servers were revoked, transient servers would be even more effective at distributed training. They argue that if cloud providers could only specify the number of servers needed from a particular cloud customer and leave the choices of which servers to be revoked to the cloud customer, it would enable more flexibility for making tradeoffs between accuracy and the rest of the training performance. "With more information on both sides we can design a distributed training system that works more efficiently," Walls says. By dynamically adding and removing GPUs during training, researchers could customize the best possible configuration for their workload.
The team expects to continue testing and asking new questions, such as how best to utilize geographically-diverse servers for distributed training. "We've just scratched the surface," Walls says. "There's a lot left to do but we’ve identified some of the challenges. For example, how can researchers access the most efficient resources when we don’t know in advance what those will be?" For Guo, their study shows that "a transient-aware system could help the day to day lives of the people designing and training models."
"With more information on both sides we can design a distributed training system that works more efficiently."Robert Walls, Assistant Professor of Computer Science, Worcester Polytechnic Institute