Deep Learning and Compute: Can we just keep Scaling?
Updated: Mar 24
Shreshth Malik examines the unrelenting growth in data and computational requirements of state-of-the-art deep learning models, and explores its implications for the future of the industry.
At a Glance
The size of state-of-the-art (SOTA) deep learning models continues to grow. The computational and data requirements to develop and train these models are immense and begs the question - can we just keep scaling? Historical trends would suggest we can as we have seen rapid advances in computational infrastructure. However, as we remove the low-hanging fruit for increasing efficiency, we should be wary of the energy requirements of these large models. More fundamentally, we must ask whether there is a limit to accuracy on scaling and when to trade off accuracy and compute as we see diminishing returns.
Deep Learning is Data and Power Hungry
Deep learning models involve a complex set of parameterised mathematical operations. They transform an input (e.g. an image) to predict an output (e.g. a classification). Training involves passing data through the model and updating the parameters in such a way as to push the models output closer to its value. Recent deep learning models have many millions or even billions of parameters; recently we have seen a 10 times growth in the size of SOTA models year on year. Thus calculating the updates for each of these parameters requires an incredibly large amount of computational power. Coupled with the increasing size of training datasets, this results in a soaring computational bill.
The increase in computational requirements for SOTA machine learning models - there is a step change at the onset of the deep learning revolution.
To put things into perspective, the latest SOTA language model, GPT-3, boasts an eye-watering 175 billion parameters. Trained on 500 billion words scraped from the public web, the training costs are estimated to be in the vicinity of $4-5M if you were to train it yourself on existing cloud infrastructure. This would have required around 190,000 kWh, which, using the average carbon intensity of America, would have produced 85,000 kg of CO2, the same amount produced by driving to the moon and back.
But the calculations above are only for training the final model. The actual cost for developing a model such as GPT-3 can be 10s or 100s of times larger due to the extensive experimentation required to find the best architectures and hyperparameters. Researchers often resort to trial and error or expensive grid searches over possible hyperparameters to find the optimal setup and maximise their final metrics. These marginal gains for greatly increased energy usage should be carefully considered - is it worth the extra 20 days of GPU usage to squeeze that extra fraction of a per cent in accuracy?
Cloud Computing: Keeping up with the Demand
Cloud services have boomed over the last few years as a result of the need for increased data and compute. Amazon’s cloud offering, AWS, has become Amazon’s most profitable business unit and is the largest international provider of cloud services.
Data centres now account for 1% of global electricity usage. Consequently, big tech has invested heavily in their services – pushing the limits on scale and efficiency. This has meant that training and running deep learning models on cloud servers is far more energy-efficient than running on local machines. Initiatives such as Google Colab also provide (limited) access to free GPU/TPU resources which has greatly contributed to democratising access to deep learning tools.
Processing chips specifically designed for machine learning continue to be developed, improving efficiencies. However, algorithmic efficiency improvements have resulted in more gains than classical hardware improvements (Moore’s law). Adding to this, machine learning can optimise processes such as cooling by capturing the complex relationships between different factors and adjust controls accordingly, adapting to changes in weather or usage. For example, DeepMind has enabled up to 40% reduction in energy usage in Google data centres by automating cooling with the help of machine learning. The net efficiency gains have more or less counteracted the growing computational requirements; despite a six-fold increase in usage over the past few years, there has only been a 6% increase in overall energy consumption. The question remains as to whether the efficiency improvements can continue as deep learning continues to scale, or whether fundamental limits will be reached. If efficiency gains do not continue, cloud computing could quickly account for 20% of global electricity usage in the worst case.
There is thus a strong consumer push for providers to shift to net neutral emissions. Many centres use renewable energy to offset their carbon expenditure, but this is strongly dependent on location. It is, therefore, best for practitioners to run their models at efficient carbon-neutral sites on the other side of the world - made possible through cloud services. Google is aiming to be net-zero in all locations by 2030.
Is Bigger always Better?
Three key factors push the AI frontier forward: algorithmic innovation, data, and compute. Some argue that all we need are the latter two. Richard Sutton, one of the pioneers of reinforcement learning, writes in his short essay The Bitter Lesson that computational improvements are by far the most significant factor in advancing the field. With its current rate of growth, it is not far-fetched to contemplate models with 100 trillion parameters soon - the same as the number of connections in the human brain. Will we have then solved intelligence?
While computation has enabled performance gains, I do not believe that computation alone can produce results. It is human ingenuity which has added inductive biases such as translation invariance via convolutional layers in neural networks which has enabled the progress we have seen in computer vision. It is these subtle but very important design choices that have enabled deep learning to have such success.
There is also the case of practicality. In a more connected world and the Internet of Things, we will require machine learning deployment on lightweight and decentralised devices. Thus reducing the size of models is vital. Neural networks are built to be overparameterised, meaning that many of their parameters are not strictly necessary. Work on pruning or distilling networks to reduce their size has shown recent success and I believe that this work will be of increasing practical importance as SOTA models continue to scale.
On a more philosophical note, while deep learning as a field has shown great promise, I do not believe it has the capacity for the end-goal of ‘true’ generalisation in AI. Deep learning models will of course play a very important part, but just adding more data and computation does not solve the intrinsic limitations to generalisability. In his famous book on Machine Learning, Stuart Russell says that this is like “… trying to get to the moon by climbing a tree; one can report steady progress, all the way to the top of the tree.”
The research community is already moving to address the issue of increasing computational requirements. New directions directly aiming at reducing the volume of training data required to achieve the same accuracy have emerged such as active learning. Bayesian hyperparameter optimisation techniques then seek to reduce the need for an extensive grid search. Moreover, initiatives to calculate the energy and carbon footprint of models have been set up to enable researchers to keep a tab on their usage.
In general, as we approach human and super-human level performance across machine learning tasks, I believe a shift towards SOTA efficiency over SOTA accuracy is required. The data and computational efficiency of models should be of equal, if not greater importance to their final accuracy after extensive hyperparameter tuning. This would enable research to be more readily translated into important real-world applications and reduce unnecessarily large energy consumption.
The UCL Finance and Technology Review (UCL FTR) is the official publication of the UCL FinTech Society. We aim to publish opinions from the student body and industry experts with accuracy and journalistic integrity. While every care is taken to ensure that the information posted on this publication is correct, UCL FTR can accept no liability for any consequential loss or damage arising as a result of using the information printed. Opinions expressed in individual articles do not necessarily represent the views of the editorial team, society, Students’ Union UCL or University College London. This applies to all content posted on the UCL FTR website and related social media pages.