How to build Large AI Models like ChatGPT efficiently
The techniques you can use to use large data models in your systems without breaking your bank
Large Models have captured a lot of attention from people. By adding more parameters and data to the model, we can add more capabilities to a system. Additional parameters allow for more kinds of connections in your neurons, giving your neural networks both better performance on existing tasks and the ability to develop new kinds of skills, as this gif shows us-
People have been excited about these models and what they can achieve. Pretty much everyone I’ve talked to recently has told me that they are looking to integrate these models into their system. However, there is a huge problem with using these models- they are extremely expensive. According to researchers who wrote, Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model, it took roughly 50.5 tons of CO2 equivalent to train the large-language model BLOOM. GPT-3 released over 500 tons of CO2 equivalent-
So in this article, we will be covering various techniques to scale up your training efficiency. Without further ado, let’s go-
Batch Size
The right batch size can be one of the most impactful decisions in your model training. Too many data scientists end up ignoring the power that setting the batch size can have on the performance of their AI Models. So how can you use the batch size to train your large deep-learning models efficiently? The first step is to increase your batch size (I know some of you are screaming about generalization. Read on, I cover that later on).
Larger Batch Sizes mean fewer updates to your model while training. This leads to lower computational costs. But this is not the only thing that makes a difference. In my breakdown of the phenomenal report, “Scaling TensorFlow to 300 million predictions per second”, I was surprised by a statement that the authors made. The authors said that they halved their training costs by increasing batch size. This happens because larger batch sizes →fewer batches needed to fully load your dataset.
There’s a cost associated to moving batches from RAM/disk to GPU memory. Using larger batches = less moving around of your data = less training time. The trade off is that by using larger batches you start missing out on the advantages of stochastic gradient descent.
-This is something a reader taught me. If you have any insights to share with me, please do comment/reach out. I’m always excited to learn from your experiences.
This technique works in a variety of data types, including statistical, textual, and even image. This is a big advantage if you’re looking to build a multi-modal system since this one optimization can take care of multiple components.
Here we show one can usually obtain the same learning curve on both training and test sets by instead increasing the batch size during training. This procedure is successful for stochastic gradient descent (SGD), SGD with momentum, Nesterov momentum, and Adam. It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times.
Now to address the elephant in the room- what about generalization and accuracy? It has been well-noted by AI Researchers that increasing batch size messes up your accuracy and generalization. There is even a well-known term for the lower generalization of large batch training- the generalization gap. About that- it’s a myth. It certainly does exist, if you increase the batch size and do nothing else. However, there are steps you can take to avoid this issue.
The authors of the phenomenal paper Train longer, generalize better: closing the generalization gap in large batch training of neural networks propose a great alternative training regimen. They realized that the fewer updates needed by LB models acted as a double-edged sword, reducing performance while improving costs. However, by implementing “Ghost Batch Normalization” we can hit up some amazing results (also think of how cool you would sound if you told people that you implemented Ghost Batch Normalization).
Similarly, for performance, you can maintain your accuracy by adjusting your learning rate and batch size proportionally. If you want more details on Batch Size and Model Performance, read this post.
Now moving on to another data related that I personally use quite a bit.
Active Learning
Active Learning is based on a simple concept. From the perspective of a Machine Learning Model, all data points are not created equal. Some points are easy to handle, while others require more finesse. If you have a lot of data, then chances are that there is a lot of overlap in data points. So you can discard a significant portion with no problems.
This adds another problem. How do we identify the data points which our model would benefit from? There are a few compelling ways. The one I’ve been experimenting with most recently has been using semi-supervised clustering to pick samples that are furthest from the centroids. This was inspired by the Meta AI publication Beyond neural scaling laws: beating power law scaling via data pruning. To those of you interested, I went over this publication in more detail here.
So far, I’ve had great results with it. However, that is far from the only thing I use. In my work, I rely on ensembles of probabilistic and standard models, randomness, and some adversarial training to build a data point filtering system. This might sound expensive, but each individual piece of this doesn’t have to be too extensive. And they can work very well, to reduce dataset size (and the amount of retraining we need). Below is a full description of the system.
The paper referenced in that discussion is called When Deep Learners Change Their Mind: Learning Dynamics for Active Learning. It introduced Label Dispersion, a new metric for quantifying the neural network confidence on a prediction. To learn more about it, check out this video. Fair warning, I’m much better at writing than I am at videos🥶🥶
Now to move on to something that has shown a lot of promise with Large Language Models.
Increasing the Number of Tokens
With ChatGPT, Bard, and now Meta’s Llama all being language models it’s always good to cover ideas that can help you design more scalable language models. To do so, I will be quoting a very interesting publication, Training Compute-Optimal Large Language Models by the mad lads over at Deepmind. In their research, they were able to develop Chinchilla, a language model with ‘only’ 70 Billion Parameters. However, the performance of Chinchilla was on another level-
Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.
This is even more impressive when you consider that Chinchilla had the same computing budget as Gopher. How do they achieve this performance? They had one key insight- too much focus was put on the number of parameters in a model and factors like the number of tokens were overlooked.
Given a computing budget, it makes sense to put scale up your training tokens and parameters in equal proportions. This leads to better performance, but also lower costs. Since we have a lower number of parameters, we will see a much lower inference cost when running the model. Don’t overlook this advantage. Amazon Web Services estimates that “In deep learning applications, inference accounts for up to 90% of total operational costs”.
…for every doubling of model size the number of training tokens should also be doubled
- Direct Quote from the paper.
To drill this point home, here is a table comparing the model size and tokens of Chinchilla with some other standards like LaMDA, Gopher, and GPT-3.
Now we’re going to move one of the ideas that I’m personally most excited about going forward.
Sparse Activation
Think back to how Neural Networks work. When we train them, input flows through all the neurons, both in the forward and backward passes. This is why adding more parameters to a Neural Network adds to the cost exponentially.
Adding more neurons to our network allows for our model to learn from more complex data (like data from multiple tasks and data from multiple senses). However, this adds a lot of computational overhead.
Sparse Activation allows for a best-of-both-worlds scenario. Adding a lot of parameters allows our model to learn more tasks effectively (and make deeper connections). Sparse Activation lets you use only a portion of your network, cutting down your inference. This allows the network to learn and get good at multiple tasks, without being too costly.
Of the sparsity algorithms I’ve come across, my favorite has been Sparse Weight Activation Training (SWAT). It has had great results, the use of zombie neurons allows for more diverse networks than dropout or other sparsity algorithms, and it has great results.
For ResNet-50 on ImageNet SWAT reduces total floating-point operations (FLOPS) during training by 80% resulting in a 3.3× training speedup when run on a simulated sparse learning accelerator representative of emerging platforms while incurring only 1.63% reduction in validation accuracy. Moreover, SWAT reduces memory footprint during the backward pass by 23% to 50% for activations and 50% to 90% for weights.
To learn more about this algorithm, watch this video. Now onto the final technique. This is a technique that Deep Learning Engineers in particular would really benefit from integrating into their work process.
Applying Filters and letting simple models do most of your tasks
The best way to build and use huge models efficiently- don’t use them a lot. Instead let simple models/filters do most of your tasks, and use your large AI model only when it is absolutely needed. This is technically cheating, but worth mentioning. Too many Data Scientists and Machine Learning Engineers get caught up in trying to build the perfect model to accomplish the task. Even if it is achieved, this model is likely going to be extremely expensive, since it has to account for lots of edge cases. A better alternative is to sometimes give up on a task. Using fixed rules models/filters to handle these edge cases is a great alternative. I will illustrate this with a personal story.
Earlier on in my journey into Machine Learning, I was asked to help a team improve their information mining bot. The bot used Deep Learning to extract the key features from the text but was struggling to extract one key piece of information. Their performance was around 68%. The bot had hit a wall, and they weren’t sure how to proceed. So they got me on.
I threw the kitchen sink of Machine Learning techniques at it. The better models were out of budget (the bot was called a lot) and the smaller models got tripped up. In the end, I gave up and presented a regex for it. The regex handled the normal cases perfectly and failed with all the weird exceptions. So we were dealing with a performance of 77%. I expected the people to be disappointed since my solution did not use Machine Learning (I had been hired as a Machine Learning Engineer).
Instead, the clients who hired me were excited. They liked the performance and low computational costs. This helped me realize one thing- no one really cared how I solved the problem, they cared that it was solved well. This opened up my mind. In the end, I used Regex systems to handle the general cases (+a few of the edge cases). I layered this with the more costly model if these filters did not work. The end result- 89% performance with lower average computational costs than the original system.
If you’re building powerful models, it is prudent to reserve their use to when needed. Most problems can be reasonably solved with simpler techniques. Reducing the number of times you have to use your juggernaut is generally easier than making your juggernaut more efficient.
That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people.
For those of you interested in taking your skills to the next level, keep reading. I have something that you will love.
Upgrade your tech career with my newsletter ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here.
To help me understand you fill out this survey (anonymous)
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
Love this “The best way to build and use huge models efficiently- don’t use them a lot. Instead let simple models/filters do most of your tasks, and use your large AI model only when it is absolutely needed.” I think there is a lot of that type of outside-the-box wisdom. Very practical. Not sexy. But very efficient.
Amazing content thanks for sharing. I think a code version of this with an example would be a great addition.