Large Language Model Algorithms in Plain English

On Friday, May 12th, we had a tremendous turnout for the "Large Language Models in Plain English" event. As promised, it was designed especially for product managers and non-technical professionals to gain a solid grasp on the mechanics of Large Language Models (LLMs) such as the algorithm behind OpenAI's ChatGPT.

At its core, an LLM like in ChatGPT is a mathematical model trained to predict the next word in a sentence. The training phase is like an intricate process of teaching a child to speak and understand language. It goes through tons of textual data, learning the nuances of language, grammar, syntax, context, and even cultural references. This is a big part of why ChatGPT is so good at understanding and generating human-like text. In the event I unraveled the complex threads of LLMs.

For those who missed the live event, don't worry! I've made sure to record it for you. The recording as well as the event notes are below as well as accessible at https://www.linkedin.com/video/event/urn:li:ugcPost:7061853853268250625/. You can also download the PDF version of the notes from:

.pdf

Download PDF • 3.28MB

To further your career and help you deliver unprecedented value to your customers through Generative AI or AI, my company, the AI Product Institute, offers workshops on Generative AI products, training programs on AI product development lifecycles, and AI business strategy, catered to both individual product managers and corporate product management teams. For more details on our workshops and training programs, please visit https://www.aiproductinstitute.com/generative-ai.

Remember, the future is not something that just happens to us. It's something we create together. Let's continue to learn, innovate, and lead responsibly.

If you have any questions, you can reach out to me via LinkedIn at https://linkedin.com/in/adnanboz or email me at adnan@aiproductinstitute.com.

Event Notes

ChatGPT chatbot uses a type of LLM called GPT (Generative Pre-training Transformer); it “generates”, it does not “answer”. It works with prompt and completion1. The prompt is the text you enter, the completion is the result you receive. Even the OpenAI API endpoints are named as “https://api.openai.com/v1/chat/completions”8 and “https://api.openai.com/v1/completions”9.

OpenAI has released GPT foundation models that have been sequentially numbered, to comprise its "GPT-n" series. However, this doesn’t mean that there is an actual model deployed with the name GPT-n, in fact, the default LLMs that are deployed on the cloud and are available under the GPT-3 version are ada, babbage, curie, davinci, text-ada-001, text-babbage-001, text-curie-001. Similarly, GPT-3.5 and GPT-4 versions have multiple models available for usage. Each of these actual models have different capabilities.12

Google Bard, another popular conversational generative AI chatbot, is based on the LaMDA (Language Model for Dialogue Applications)2 family of large language models, as well as PaLM 2 for improved multilingual capabilities3.

Most LLM are trained through a process known as generative pretraining4, where the model learns to predict text tokens from a given training dataset. This training can be generally categorized into two primary methods.

GPT-style (Generative Pre-trained Transformer). Aka autoregressive ("predict the next word"). The model is presented with text tokens such as "I enjoy reading" and it is trained to predict the upcoming tokens, for example, "a book in the park".

BERT5-style (Bidirectional Encoder Representations from Transformers). Aka masked ("fill in the blank test"): The model is provided with a text segment with certain masked tokens, like "I enjoy reading [MASK] [MASK] in the park", and it is expected to predict the concealed tokens, in this case, "a book".

GPT and BERT’s foundation lies in the transformer6 neural network architecture, a novel concept that revolutionized language processing tasks. The transformer neural networks depart from the traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs). Instead, it uses a mechanism called 'attention' to understand the context of words in a sentence. In simple terms, the 'attention' mechanism allows the model to focus on important parts of the input sequence when producing an output, rather than processing the input in a fixed order or looking at it in isolation.

The training data for generative pre-training for text is called the corpus. In linguistics, a corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts.7 BERT has 3.3 billion words, GPT-2 10 billion tokens, GPT-3 499 billion tokens (410B from Common Crawl, 19B from WebText2, 12B from Books1, 12B from Books2, 3B from Wikipedia), LaMDA has 1.56T words (168 billion tokens) and PaLM has 768 billion tokens in their corpora.4

In natural language processing, a token is a piece of a whole, so a "token" could be a word or part of a word. The way a body of text is split into tokens can vary. For English language models, tokens are often individual words and punctuation. In GPT-3, a token is more accurately a subword unit.10 In ChatGPT, the model reads in one token at a time and tries to predict the next token, given the previous ones.

OpenAI provides a web tool called “Tokenizer” at https://platform.openai.com/tokenizer that allows anyone to see and count the tokens of a prompt.

LLMs are mathematical functions whose input and output are lists of numbers. Consequently, words must be converted to numbers. In general, a LLM uses a separate tokenizer. A tokenizer maps between texts and lists of integers.11 Different models can use different tokenizers: GPT-3 (OpenAI), Uses Byte Pair Encoding (BPE), a form of subword tokenization that can break down words into smaller parts. BERT (Google), Uses WordPiece tokenization, another form of subword tokenization similar to BPE. It splits words into smaller units, and prefixes all but the first token of a word with '##' to indicate that they are subparts of a larger word. RoBERTa (Facebook), a variant of BERT, uses Byte-Level BPE, which operates at the byte level, allowing it to handle any possible byte sequence, and not limited to just unicode characters.

The term "max token" is also used to refer to the maximum length of text the model can handle in one pass, often referred to as the "maximum sequence length" or "context window." For instance, GPT-3.5 has a maximum sequence length of 4,096 tokens total for prompt and completion, GPT-4 has 8,192 and GPT-4-32k has 32,768.12

The number of tokens in the prompt and completion together also determines the price of the OpenAI api. For example, as of February 2023, the rate for using Davinci is $0.06 per 1,000 tokens, while the rate for using Ada is $0.0008 per 1,000 tokens.

Parameters are the learned parts of the model. For GPT-style models, which are transformer-based models, the parameters are the weights and biases in the various layers of the model. During training, the model learns the best values for these parameters to predict the next token in the input text. The number of parameters in a model often correlates with its capacity to learn and represent complex patterns. For instance, GPT-3 has 175 billion parameters, BERT has 340 million, LaMDA has 137 billion, PaLM has 540 billion, GPT-4 is approximated as around 1 trillion.13

Bi-gram language models (LMs), n-gram language models for n=2, provide good examples to understand the inner workings of GPT-style LLMs. They also predict the next word in a sentence based on the previous word, token or even character, essentially considering pairs or "bi-grams." A character-based bi-gram model, for instance, would generate the next character for a text.

Take for example, given the corpus “sunny day.”, the bi-gram model would create the following bi-grams:

s -> u (u comes after s),

u -> n (n comes after u),

n -> n (n comes after n),

n -> y (y comes after n),

y -> (space comes after y)

-> d (d after space)

d -> a (a after d)

a -> y (y after a)

y -> . (. after y)

This complete logic is a language model.

The unique set of these characters are called the “vocabulary”. In this example the vocabulary consist of the “adnsy .” unique characters. And the “vocabulary size” is 6 characters. BERT has a vocabulary size of 30K tokens, GPT-3 ada model 50K tokens, GPT-3 davinci 60K tokens, babbage and curie with 50K tokens.15

This information is enough to predict and generate the next character. If I would start a prompt with an “s”, what would you complete it to if you would be a bi-gram LM? “u”, of course.

You can keep generating until “sun” without any doubt. Once you hit “n”, you encounter a problem because, given “n”, our simple model tells us that there are two possible options; either another “n” or “y”. Which one would you choose if your logic would be limited to the model?

This is where the probabilistic behavior of LLMs comes into the picture. In these cases the algorithm would roll a dice to pick one. As you can imagine, only 50% of the time it would be the correct letter. However, if we would use a larger corpus such as “sunny day in ny.” where “y” follows “n” two times but “n” follows “n” only one time, then the probability would be distributed as 66% for n->y and 33% for n->n.

As you can imagine, with a much larger corpus such as tens of thousands of words, the possibilities of the characters coming after “n” would be ranging from “a” all the way to “z” and even other characters such as “:”, ”.” and others. These can be seen as probabilities and represented in a way we call probability distribution. Then, we need to decide which one to pick again. Due to how our language is formed, it is highly likely that we will always have some probabilities larger than others, which will allow us to make a choice. However, what if our corpus is biased, what if the LLM architecture we picked can really not store enough logic? Therefore, LLMs throw the dice on the probability distribution based on the weight. For example, if n->y happens 4% of the time and n->n 1% of the time, then every time I run the LLM with this prompt there will be a higher chance to get y after n, rather than n.

This is the fundamental reason why you can get a different completion from ChatGPT. But, this issue is not specific to LLMs. Every ML model that has to cross the boundaries from the probabilistic world to a deterministic one, where only one choice can exist, will encounter the same problem. Since we are not dealing in the quantum world or performing the Schrödinger's cat experiment, we will always have this problem of choosing one final output over another. In cancer detection from x-ray images, for instance, although the ML classification algorithm output provides a probability, something or someone has to make the final binary decision of benign or malignant. Would you pick the benign output that is showing 50.0001%?

To provide some level of control to the users, OpenAI provides two parameters: temperature and top_p. You can set them in your API call or test it out in their Playground at https://platform.openai.com/playground .

A low temperature makes the output more focused, potentially causing repetition by favoring the most likely next word. At a temperature of 1, the model uses the raw values directly, striking a balance between diversity and coherence. High temperatures increase output diversity but might lead to nonsensical outputs by giving more weight to less likely words. This happens because the raw values are divided by the temperature prior to applying the exponential function, causing larger temperature values to yield smaller results pre-exponentiation, hence flattening the probability distribution. This allows multiple options to have similar probabilities. This is represented by the formula: softmax(x_i/T) = exp(x_i/T) / Σ exp(x_j/T) for all j.

In top_p, also known as nucleus sampling, instead of selecting the most probable next word, it adds an element of randomness to make the generated text more diverse. A low Top_p value means that the model will only consider a small subset of the most probable next words, leading to more focused and coherent, but potentially repetitive, outputs. At a Top_p of 1, the model considers all possible next words, leading to more diverse outputs. When the Top_p value is set to a specific fraction (say 0.9), the model dynamically selects the smallest set of next words whose cumulative probability exceeds this fraction. This means the model may consider a larger or smaller set of next words depending on their individual probabilities. This approach allows for more randomness than temperature scaling, but still places a higher likelihood on more probable words.

Emergent abilities of large language models are unpredictable behaviors or skills that are not present in smaller models but emerge in larger ones, and cannot be easily predicted by simply extrapolating the performance of smaller models. For example, large language models have been shown to perform well on tasks that require reasoning or inference, such as question-answering, even with limited or no training data. The existence of emergent abilities in large language models raises important questions about the potential for further expansion of the range of capabilities of these models through additional scaling. It also motivates research into why such abilities are acquired and how they can be optimized. If you want to learn more about this topic please read “Emergent Abilities of Large Language Models”16.

Language model "hallucination" refers to when a model generates information not present or implied in the input. This can occur due to biases in training data or limitations in model architecture. For example, if you ask a language model "What is the color of George Washington's smartphone?" it might respond, "George Washington had a yellow smartphone." In reality, smartphones didn't exist in Washington's time, so this is a hallucination: the model is inventing details that aren't factual, based on its training on modern language and lack of deep understanding of historical context. Hallucination is often defined as "generated content that is nonsensical or unfaithful to the provided source content".17

References

1 https://platform.openai.com/docs/guides/completion

2 https://en.wikipedia.org/wiki/Bard_(chatbot)

3 https://blog.google/technology/ai/google-palm-2-ai-large-language-model/, https://blog.google/technology/ai/bard-google-ai-search-updates/

4 https://en.wikipedia.org/wiki/Large_language_model

5 https://en.wikipedia.org/wiki/BERT_(language_model)

6 https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)

7 https://en.wikipedia.org/wiki/Text_corpus

8 https://platform.openai.com/docs/api-reference/chat

9 https://platform.openai.com/docs/api-reference/completions

10 https://en.wikipedia.org/wiki/Lexical_analysis#Token

11 https://en.wikipedia.org/wiki/Large_language_model#Tokenization

12 https://platform.openai.com/docs/models/gpt-4

13 https://en.wikipedia.org/wiki/Large_language_model#List_of_large_language_models

14 https://en.wikipedia.org/wiki/N-gram_language_model

15 https://learn.microsoft.com/en-us/semantic-kernel/concepts-ai/tokens

16 https://openreview.net/forum?id=yzkSU5zdwD

17 https://en.wikipedia.org/wiki/Hallucination_(artificial_intelligence)

Large Language Model Algorithms in Plain English

Event Notes

Recent Posts

Thanks for subscribing!