Everything You Need to Know About How Generative AI Works
Delve into Generative AI: its transformative prowess crafting lifelike images, text, and more. Uncover the workings of ChatGPT, a pinnacle in AI innovation, driven by vast data and fine-tuning to assist seamlessly.
By now, you’ve probably heard about how generative AI is taking over the world. But what is generative AI? Long ago (~2010), when AI was still in its infancy, AI was talked about in other, simpler contexts. Like separating cat images from dog images. Or to recommend movies on Netflix. Or classify amazon reviews by sentiment.
But Generative AI is a bit different
Here, the end result is an object in the real world - be it text, image, audio, or video. Unlike the previous cases, this output isn’t a single 0/1 bit - but can be hundreds or even millions of numbers for a single generation. This can be to draw the face of a person that is lifelike, an image of a painting in the style of Van Gogh, or even produce Hollywood-level movie trailers. Generative AI is also what powers the most famous AI product on earth today - ChatGPT.
So what goes on behind the scenes in a system like ChatGPT?
For starters, it’s basically a chat client hooked up to a powerful language engine called GPT. Ok then, so what’s GPT? GPT stands for Generative Pre-trained Transformer. Let’s take a deeper look at what each of these words mean:
1) “Generative” because it starts with a small amount of text and extends it by generating new words.
2) “Transformer” is a type of architecture that was developed by Google in 2017. It is now used in almost every AI model, including text, vision, and audio. We’ll cover this in detail in a later article.
3) “Pre-trained” refers to the fact that the model is trained on a wide body of text on a highly broad, universal language task. This is different from the olden days where you would have a different language model for every task, like sentiment detection, question-answering, summarization, etc.
Let’s dive deeper into the pre-training part
You might be wondering, how can a model trained on just one task capture so many diverse aspects of language? Well, the secret sauce consists of 2 key ingredients - the broad nature of the task, which for GPT is “next word prediction”, and the size of the training data.
Consider this sentence: “Mary decided to bake an apple ____”. In trying to predict the next word after apple (i.e. in place of the blank), we can think of several options:
1) Reasonable food items like pie, bread, tart, muffin. These are viable options because they are baked dishes made from apples.
2) We can also think of other suffixes to apple that don’t work as they are not baked food products: juice, cider, seed, tree.
3) We can think of baked foods that aren’t made with Apple: pizza, lasagna, brownies, croissants.
4) And lastly, we can come up with nonsense words that don’t fit at all: book, steel, kitchen, sink.
Even a simple bigram model can discard the words in case of no. 3 and no. 4, as these are words that would never appear next to “apple”. However, weeding out the words in case no. 2 is not so easy. For these, the model must understand the words “apple” and “bake” simultaneously. Fortunately, advanced models like GPT-4 are powerful enough to learn such associations, so given enough time and training data, the model will learn to answer correctly.
Ok great, now we have a model that can predict the next word efficiently. How does that help us?
Well, first of all, the internet has a ton of data, which is great for us since these models tend to be data-hungry. Secondly, the hope is that by consuming mountains of data which is written by humans, and learning to complete their sentences, the language model would itself start to talk like a human. Thirdly, and most importantly, running next word-prediction over and over can generate sentences, answers, poems, essays, and even entire books.
It is very important to realize the scale of data these models consume
The largest GPT models have read nearly every book written by man, in addition to reading a significant chunk of all the data on the internet - code, news articles, blog articles, social media posts. To get a sense, the average human reads about 110 million words in their lifetime, whereas GPT-4 is trained with more than 10 trillion words (equivalent to a hundred thousand humans). So these models are far more knowledgeable than any single human on earth, and this does make them “smarter” in a sense.
Ok great, you say. I can now make my dreams of talking to the average internet nerd 24/7 come true.
But how does that help me?
After all, as anybody who has used the internet knows, people there are not always the most helpful, the most knowledgeable, or the most accurate people around. This is addressed by the miracle of fine-tuning, especially in its most famous form RLHF (Reinforcement Learning from Human Feedback). The math behind this process is fairly complex, but here we can look at a simplified representation. Essentially, something like this happens during RLHF:
Case 1: Reclusive bot.
User: Help me bake a birthday cake.
Model: Buzz off! Do it yourself!
Trainer: Bad model (-10 behavior points).
Case 2: Helpful bot.
User: Help me clean the dishes.
Model: First we need to make sure we have soap or cleaning liquid, a scrubber, a water source. Clean the dishes one at a time…
Trainer: Good job! (+10 behavior points)
The outputs of the above two queries are shown to human judges. These judges have been told to evaluate models on various criteria like accuracy and helpfulness while minimizing bias, harm, and unfair behavior. In the above example, these judges would give the dishwashing bot a proverbial pat on the back, while the cake bot would be asked to mend its ways.
After enough rounds of the above training, the language model starts behaving like a nice, polite, and helpful person! It can now help you with any task, from finding the capital of France to implementing merge sort in Javascript.
And one day soon, it might even bake you an entire cake!
Discussion