Introduction to Building an LLM

- Intuition for decoder-only LLMs - Tokens, embeddings, transformer pipeline - Autoregressive next-token generation - Generative AI modalities overview - Diffusion vs transformer model families - Inference flow and prompt processing - Build a real LLM inference API - Architecture: attention, context, decoding - Training phases: pretrain to RLHF - Vertical vs generic LLM design - Distillation, quantization, efficient scaling - Reasoning models: Chain of Thought and Test Time Compute - Hands on Exercises

This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only

Unlock This Course

Get unlimited access to Power AI course with a single-time purchase.

$Thumbnail for the \newline course Power AI course$

[00:00 - 00:04] What we're going over is two things. We're going over some high level generative AI, kind of technical concepts.

[00:05 - 00:12] And then we're going to go a little bit into the high levels of the internals of foundational models. We want to introduce you to the capabilities.

[00:13 - 00:27] And the other thing we're going to talk about is building up intuition for decoder-only arts language models, and talk about inference, talk about model quality, reasoning, and large language for an about life cycle as well. What is generative AI exactly?

[00:28 - 00:35] So generative AI is merging the language with the particular set of data. So you could call it merging prompts with data.

[00:36 - 00:48] So whether it's language with the image, language with music, language with code. And because a lot of you are in kind of a business applications, this is merger of language with SQL, language with sports, language with PDFs, language with documents.

[00:49 - 00:56] Some of you want to do different things with different modalities like 3D or video. And it's a merge of video and 3D as well.

[00:57 - 01:04] So what is generative AI exactly? Basically, at the very base layer, generative AI is a neural network.

[01:05 - 01:09] So neural networks were not new. It started in a long time ago, basically 1950s.

[01:10 - 01:33] But it wasn't until 2016, where GPUs were good enough so that a lot of parallel computation could actually train these, what they call, the neural networks, to be able to get state-of-the-art results in particular kind of by areas. So you may have heard me talk in the webinar, basically about large language models being pre-built, kind of about eyes, ears, nose, and mouth.

[01:34 - 01:44] So that actually started in 2016 to 2017. And the breakthrough was in 2016, when you had a convolutional neural network, which is, I think it's here, it's one of these, basically.

[01:45 - 01:54] It's a CNN, basically, yeah, okay, it's actually not here. But it's a convolutional neural network where it achieved a state-of-the-art results in image processing, being able to recognize images.

[01:55 - 02:01] And then transformers were invented by Google, they didn't really believe in it. The inventor of transformers left and basically did a character AI.

[02:02 - 02:13] And then because of the open AI thread in 2022, Google's traffic actually dipped. And this is basically what caused the huge panic inside Google, basically to go after transformers, large language models.

[02:14 - 02:26] And to be specific in this course, we're covered, we're really focused on one type of architecture. So you'll basically see in other courses, basically kind of where they try to cover a lot of different architectures, especially deep learning, and about courses.

[02:27 - 02:35] We're really focused on transformer-based large language models. But, and I'll be even more specific, it's decoder-only architecture of large language models.

[02:36 - 02:39] It's not an encoder-decoder architecture. Then we're gonna go into specifics.

[02:40 - 02:48] As you basically get into a different type of functionality, there are different type of architectures from the database. So multimodal language models use vision transformers underneath.

[02:49 - 02:58] So other systems like video generation, you use this, what's known as a diffusion transformer in a vast system underneath. Avatar generation basically uses different architectures.

[02:59 - 03:08] So for this is, we really had to focus, we can't do avatar generation and all these different things. In one course and be able to teach you how to really adapt basically to transformer-based language models.

[03:09 - 03:16] So what is a large language model, exactly? So large language models, the heart of it was basically invented in 2016, 2017.

[03:17 - 03:19] It's called the transformer. Basically it's the heart, it's the core data structure.

[03:20 - 03:32] We've defined what we know as chat GPT basically. And it was invented then, but really large-scale data wasn't applied to it until 2022, basically in 2021.

[03:33 - 03:47] Basically when GPT-3 and then chat GPT-4, what you know is chat GPT, it's GPT-3.5. And what they really did is they really download the internet's data, used a decoder-only architecture, and then they mixed it with reinforcement learning.

[03:48 - 03:52] So there's actually different flavors. If we import from learning, we're going to go into different flavors.

[03:53 - 03:58] But at a very high level, it's large language models. It's the merge of language models and reinforcement models.

[03:59 - 04:09] So what you basically have is you have two core diffusion models, basically at the, sorry, two core kind of models at the heart of everything. One is transformer-based language model.

[04:10 - 04:14] The second is diffusion models. So we're going to go into a little bit of the diffusion models.

[04:15 - 04:22] And then you basically have video models, which are hybrid between diffusion and transformers. Basically they're known as former diffuse-former systems.

[04:23 - 04:29] It's came out in 2022. And then there's a state of the art video system that uses it as a backbone.

[04:30 - 04:35] Basically this came out this year, basically three months ago or something. And then after that, you have an optimization problem.

[04:36 - 04:45] So you'll basically see in the media, people talking about different architectures, for example, Mamba. Recently, basically NVIDIA basically released a hybrid Mamba transformer architecture.

[04:46 - 04:50] Why? Basically because a lot of the labs basically require a hundred million dollars to train one of these models.

[04:51 - 04:57] And so what they want to do is they want more people to train models and use their GPUs. So they actually open source the entire pipeline.

[04:58 - 05:02] And what they use is they use a transformer Mamba architecture. Why?

[05:03 - 05:10] Because basically the internals of the transformer is not very efficient. And Mamba optimizes both time as well as a space to basically make him much more efficient.

[05:11 - 05:15] This came out like maybe two weeks ago, three weeks ago. And then you can take a look at it.

[05:16 - 05:25] So all of it is open source. So you can actually take the concept that we have right now, basically and look at the NVIDIA architecture and go into the actual specifics of the code.

[05:26 - 05:40] Traditionally, you had a lot of people referring to generative models as first generation or second generation. Basically, like you look at the academic literature, it's really generative, it's a technology called GANS, but I'll really refer to something that's really practical.

[05:41 - 05:46] The first technology in my mind that's really practical is to use models. You may know this as a mid-journey, basically.

[05:47 - 05:59] So what a diffusion model basically does is it basically takes something that's pure noise, and then you have a reference kind of a language or your reference image. And then what it basically does is it systematically de-noises it.

[06:00 - 06:15] Basically it goes from a grainy picture like where you basically see in like the old televisions, basically, and then systematically basically go toward a natural image. So if you ever use mid-journey or you use some inter-design AI applications, you'll basically see this happening.

[06:16 - 06:24] And this is basically underneath, it's called a diffusion-based technology. So one of the core things about generative AI is that you're able to mimic and transfer style.

[06:25 - 06:36] And so you're able to take a Picasso or Kirumba style, and then you're able to transfer style. And you can also see this with prompts when you're basically doing prompts, you have a lot of role-based prompting.

[06:37 - 06:47] Imagine you're a marketer, you're an expert marketer, and then you basically write the prompt. So this is basically a core part of the technology, and it's what's known as style transfer.

[06:48 - 06:54] And you have different ways of generating different examples. You have more artistic, and then you can also generate photorealistic examples.

[06:55 - 07:10] So diffusion models are basically great for artistic style transfers, basically creative remixing, but it's not very precise. It sometimes runs into problems where you have a language and you have different artifacts in the image, and then you have to keep on trying it over and over again.

[07:11 - 07:16] So this is actually the reason why Chetupiti's image is actually based on transformers. It's actually not based on diffusion.

[07:17 - 07:25] So if you actually basically use Chetupiti's image patient system, it's much more granular. Basically, you're able to place tech, you're able to do more things.

[07:26 - 07:31] It's basically because it's transformer-based, not diffusion-based. So you have different kind of artifacts examples.

[07:32 - 07:44] So yes, you can basically get a picture like this, but you may basically have to keep on experimenting with the parameters and tuning it and so forth. And then you can have product placement, but what you basically see here is you have text problems.

[07:45 - 07:57] You see the Coca-Cola, and it's basically, it's mixed up. And so it's hard to basically have shape consistency, text generation consistency, or really adjust it and be specific to a given object.

[07:58 - 08:04] However, you can fine-tune it, basically. So there's a number of AI applications where you can fine-tune it to do this.

[08:05 - 08:15] So if you looked at this on the very left, you would actually probably think, this is a DSLR camera that shot it. And then if you looked at this, you would basically think, oh, that's a good artist that basically was able to create it through digital tools.

[08:16 - 08:23] So it's able to create good results. And in fact, some of the examples I put in the webinar are fine-tune diffusion models, basically.

[08:24 - 08:35] So if you want to basically do headshot applications or other things like that, you can also basically fine-tune it more specifically for, and there's different technologies to fine-tune it more specifically. So there's something called control net.

[08:36 - 08:41] Basically, you can also fine-tune it with a particular style. You can fine-tune it for interior design or other things if you want.

[08:42 - 08:49] So you can also basically adjust it for fashion as well, basically to be able to generate this. I'm sure you guys have seen basically fashion models now.

[08:50 - 08:56] There's like fashion models, avatars are being created. The problem with basically a lot of fashion model avatars is it depends on the technology.

[08:57 - 09:04] Sometimes it's diffusion-based, but sometimes it's other architectures. So the second generation was really transformer-based, multimodal, large-language models.

[09:05 - 09:09] And, yeah, Kevin mentioned nano-banana-made improvements as well. Yeah, that's true.

[09:10 - 09:20] And so, well, you basically see what people now refer to as AI is really transformer-based, large-language models. Transformer-based language models both have image and the language attached with it.

[09:21 - 09:32] So you can basically do code, you can basically do music, you can do voice, you can do a lot of different applications. One of the fundamental conflicts of transformers are basically what are tokens.

[09:33 - 09:41] Basically, so tokens are basically a dictionary. So when you're adapting systems, a lot of times you're not basically dealing with tokens, but you are dealing with embedding.

[09:42 - 09:53] And so, transformers don't see tokens, don't see words, but they primarily see tokens. And tokens are like a chunk of words, sub words, and it's basically converted into numbers so that it's able to manipulate it.

[09:54 - 09:59] So, transformers are able to predict the next token, basically one at a time. Anything can basically become a token.

[10:00 - 10:11] And you can have code, you can have legal information, medical, design, and everything. So you can basically convert everything into a token, which can become an embedding, and then you can store it into a vector database.

[10:12 - 10:23] There's a lot of difference about modalities that you can basically control, but the reality is once you go into text-based modalities, you can control pretty much anything. So there's actually a lot of different things.

[10:24 - 10:33] As long as it can be described in text, it can basically be controlled via language. So what that basically means, so tokens are a simple transformation.

[10:34 - 10:51] Basically, so call it like, I'm not sure if you've ever basically done silly cryptography games where you're trying to create a cipher, basically. So in some sense, basically like tokens are like a basic cipher, basically, where you're just translating it to some type of dictionary, basically that can represent the information.

[10:52 - 11:01] So in this case, this would basically be more of a token rather than embedding. And embedding is basically taking this information and adding more and more meaning to it.

[11:02 - 11:15] So what we basically see is that the entire process of artificial intelligence is basically you take some time. What we basically know as a cat is not as simple as understanding that it's CAT.

[11:16 - 11:19] Basically, we understand a cat as multidimensional. We understand that it's an animal.

[11:20 - 11:23] It could be a pet, it's furry. Basically, it's not violent to us.

[11:24 - 11:32] Basically, it's almost like family to us. Basically, so there's a lot of meanings that's basically embedded inside basically the meaning that the artificial intelligence has to pick up.

[11:33 - 11:50] So the process of the pipeline where you basically go from a token to an embedding to adding attention is you're effectively enriching it with meaning, basically further and further, basically, so that it adds these meanings so that it's native to the actual clan about word. That's basically kind of about the process.

[11:51 - 11:55] So tokens are a relatively simple representation. Embedding are more complex.

[11:56 - 11:59] Embeddings are typically a tensor. Basically, it looks like a multidimensional matrix.

[12:00 - 12:09] So in general, basically, a lot of your applications can be described via text and text. So some of you guys basically have mentioned to me that you may want to do 3D or graphic manipulation.

[12:10 - 12:20] There's two ways to basically do a lot of these things. One is you create a neural network architecture or you fine-tune an existing architecture and then you're able to describe it in that architecture.

[12:21 - 12:31] The other way is you describe it in text. So anything that's 3D, describe it in a textual format and then you're able to map the language to the textual format and then you're able to basically train it to do.

[12:32 - 12:42] So that's why you're able to basically train things that are text-to-sequel, text-to-document, text-to-code. Anything that can be just described text-to-text is you basically map the output and then basically input.

[12:43 - 12:51] We're going to go into specifics later, but I want to basically mention. So specifically, a lot of the applications now are basically chains of models.

[12:52 - 12:55] So for example, you build a voice receptionist. It interacts with Twilio.

[12:56 - 13:01] It interacts with IVR system. And so it basically does real-time voice processing, voice-to-text processing.

[13:02 - 13:12] And then the text processing has a reading component and then it then basically calls like a CRM or scheduling application. So effectively what you have is voice-to-text transcript.

[13:13 - 13:20] Then you have text-to-text tool basically and there's a chain of applications. So sometimes it's the same one, sometimes it's different.

[13:21 - 13:30] Basically, generally speaking, you want the right tool for the right kind of use. So there are certain voice-based applications that are specifically trained for voice-based that are better.

[13:31 - 13:41] And then there's other solutions that are based on design for tasks that are better for those type of things, like for example, using tools. So transformers output step by step.

[13:42 - 13:49] So it goes from the prompt to the token prediction to the final output. So that entire process is called the inference.

[13:50 - 13:57] So you'll basically hear people talk about pre-training, fine-tuning, and then inference. So inference is basically where the model is serving you.

[13:58 - 14:10] So think of it as an API call basically to the database. When you hear people talk about pre-training, is they're downloading the entire internet's data and they're basically training a large language model on the entire internet's data.

[14:11 - 14:26] When you're talking about fine-tuning, it's refining that data set to be more through reinforcement learning or for your specific use case. And then the model will basically converge on an answer and then it will basically do a step by step until it becomes about meaningful or correct.

[14:27 - 14:32] Transformers basically don't use traditional databases. Transformer-based model, you have to store everything in vectors.

[14:33 - 14:42] I'm sure you basically saw in the workshop, everything is used as using vectors and so you have to store everything in a vector database. How does a model turn a prompt into a response?

[14:43 - 14:51] You first have multiple steps. You have tokenization, then you have model embedding, then you have the model architecture and then you have the outlook kind of generation.

[14:52 - 15:07] So generally speaking, when you're adapting foundational models, you primarily deal with giving examples basically for the embedding and then you're fine-tuning the different layers in step three. So from an adaptation standpoint, you're basically primarily dealing with two and three.

[15:08 - 15:18] Basically, if you're building the foundational model, obviously you're building with once before, basically. And then output generation, you deal with as well, basically, but it's primarily two and three where you're basically what they basically call fine-tuning.

[15:19 - 15:29] So to speak, tokenization basically converts text into about tokens and tokens are chunks of language. And so you can convert something and split it into individual components.

[15:30 - 15:40] So AI is cool, can be AI comma is comma cool or it can basically be a split into the different ways. Then embedding are more complex basically.

[15:41 - 15:56] So instead of splitting everything into AI is cool and then converting it only to a number, which is like a dictionary, every single word is now a series of numbers. So cool can be 0.23, negative 0.45, 0.76.

[15:57 - 16:11] And these vectors basically capture meaning and context. So what you basically should think of about this entire 10-bus step is you increasingly enrich the context as you go through the steps basically and then it's able to output the information.

[16:12 - 16:24] Architecture is where you have different layers, basically, and it goes into the model core and then you're able to go through the different layers depending on the architecture. So sometimes it's encoder, decoder.

[16:25 - 16:29] Sometimes it's decoder only. There's different components inside where you're basically kind of going through layer, right?

[16:30 - 16:45] And then the model generation is output it one token at a time and then this is basically what's known as auto-aggressive combat generation. So you basically have the initial prompt, then you have tokenization, then you have the embedding, then you have model processing, and then you have the generated output.

[16:46 - 16:54] Tokenization and embeddings were really invented before. So the key difference was basically the transformer, basically that was invented 2016 and 2017.

[16:55 - 17:10] What have been later in 2021, 2022, which caused the current boom is that transformers was trained on the entire internet's data and then kind of applied reinforcement learning. So some of these technologies like Asian embedding are not new.

[17:11 - 17:22] Basically, transformers is like 2016 technology, but it's the putting together of everything. That's 2022, where the entire kind of entire pipeline was put together and that became known as SHFAPT.

[17:23 - 17:36] The parameter size is generally related to how many tokens are trained on, basically. So you basically have all these data is basically kind of trained on Reddit data, Wikipedia data, crawling forums and other types of data.

[17:37 - 17:50] And then you go through a training process and you get a parameter size. Basically, so if you basically think of everything as being enriched more and more, the more training you give it, the more context is able to understand.

[17:51 - 18:03] So this is why GPD4, I think it's rumored to be one trillion parameters, basically. And so a lot of the open source models are quite a bit smaller than that, partially because people want to run their own models, but it's for a specific solution.

[18:04 - 18:11] So it doesn't need the entire internet's context, basically for a specific solution. Basically, as you get smaller, it retains the capabilities.

[18:12 - 18:22] Sometimes you'll have to basically see people generally like the performance degrees as you basically get to smaller and smaller parameters. And so what people are basically doing is they're experimenting with different things.

[18:23 - 18:41] So you'll basically see reasoning models, you'll see hybrid reasoning models, you'll see talking about different experiments and people are getting decent results basically, but these are almost like weight classes. Basically, when NVIDIA, NIMO, Mamba architecture basically came out three weeks ago, basically they basically said, we're state of the art in our weight class.

[18:42 - 18:54] Basically, it's because if you think about boxers, like a heavyweight is not going to go against a featherweight or kind of different things, can we get to state of the art results basically like running entirely local models? It's right now, it's very hard.

[18:55 - 19:02] Basically, there's more attempts at it. Apple just released a vision video transformer model that's like ADX more efficient.

[19:03 - 19:24] Basically, the only difference is the model architecture. So different systems will basically have, if you think about it from a backend architecture standpoint, right, some people will basically have microservices API, some people will have a low balancer, some people will have memory caching for read and write, some people won't, basically some people will have key values for some people won't, some people will.

[19:25 - 19:39] So what you'll basically see is that different people, the core technology is the same, which is transformers, and but people will have different types of layers and then people will have different flavors of transformers. Basically, some have transformers, which are making it more efficient.

[19:40 - 19:49] Basically, some have flash attention transformers, some people, so you basically have different flavors and people are experimenting with the different layers. Basically, they do different things.

[19:50 - 20:28] So roughly the core thing is still a transformer inside, but it's all different flavors to optimize for visionless solutions, whether you're basically trying to optimize for thinking through things, like reasoning models or hyper reasoning models versus things that are super fast for embedded systems, or so that's why I basically put everything at a high level so that you understand the basic things for all systems is tokenization embedding model architecture and now, but we're gonna go into the model architecture a little bit later, basically, because there's all sorts of different flavors. So I'm not sure we ever watched, I don't know, for example, basically it goes over all the different types of shrimps, basically.

[20:29 - 20:39] And so this is basically like that. And basically the core thing is a transformer decoder on the GPD, but because people are optimizing it for different use cases, there's a lot of different flavors inside, so yeah.

[20:40 - 21:00] And so part of what we're basically trying to do at this course is basically we're trying to establish a mental scaffolding so that you're able to understand it before we go into the details, basically, so that you're able to understand the scaffolding and then you're like, oh, okay, I understand. The basic process is this, and then when we learn the architecture, you're able to say, okay, this is the internals of step three, basically.

[21:01 - 21:14] Yeah, inference is basically when the AI model is used, and then it's used to basically effectively if you're studying for a test, inference is taking the test and using what you've learned to answer it. One of the things we're gonna be building in the exercise is you basically build your first inference, basically.

[21:15 - 21:30] And so you'll be able to see basically what is AI inference, basically at a very basic level, and then you'll be able to see, we'll show you kind of a different component. So the idea is to basically be able to input something, have a model, have it, and then you can test your and about application as well.

[21:31 - 21:41] Yeah, prompt, kind of about processing, basically kind of takes the input, and it goes through a tokenizer. You get a series of token IDs, and then you go into the model, and then it outputs up a bunch of output IDs.

[21:42 - 22:03] The output IDs basically then goes to the token again, and then you get into a series of outputs. The core ingredients basically behind inference is that it's relatively simple, but every large language model where you wanna basically be aware of is built up model architecture and also the training data, and then the optimization algorithm, and also the application task as well.

[22:04 - 22:29] Effectively, the training data of what varies, depending on the model, and there's a common set that are open-sourced, but what's happening is different model creators are doing special data rights, basically with different companies as well, like Reddit has data rights with all gadget D, and so forth. And then different optimization algorithms, you'll basically see flash attention, you'll see different optimizations that people are able to do, and then the model architecture is the different layers.

[22:30 - 22:42] People basically, these things are like Lego blocks. Basically, we're gonna go into some of the different Lego blocks afterwards, but you'll see it very vividly with video architectures where they have a lot of different types of Lego blocks there.

[22:43 - 23:00] Your AI application, basically, or your AI technique really depends on your application. It depends on whether you basically build a vertical-term domain's physics knowledge into your model or not, and then how basically guide it with prompt engineering, and a number of you guys might be wondering, how do I guide the actual application to basically generate this?

[23:01 - 23:22] The key kind of our technology is using it for instructional fine-tuning, so that you can improve task following for your specific kind of our use case, and then retrieval augmented generation allows you to generate real-time facts or documents, and then we have a lot of different variations of fine-tuning. So the reason why we basically go into kind of about the embeddings is this cohort, we basically added fine-tuning of the embedding models.

[23:23 - 23:36] Fine-tuning of the embedding models allows you to give it positive and negative or neutral samples to be able to tune it for a specific kind of our use cases. So your goal, basically, as AI engineering, is really kind of dependent on your specific applications.

[23:37 - 23:48] Large-language models, basically our internals are a model architecture, so it's the internal blueprint, and it's kind of about different layers. As we basically go on into the course, you'll learn the different components and the different layers.

[23:49 - 24:05] So this might be basically overwhelming right now, basically, but this is basically what a transformer is. Basically, you basically have the input underneath, you go into embedding, and then you basically apply a tension or what's known as a key query value to it, and this is specifically multi-head attention.

[24:06 - 24:17] And then you add normalization, and then you add MLP, a feed port network, and then you normalize it again. And so this entire process is basically was previously invented, which is encoder-defloder-conver architecture.

[24:18 - 24:44] And this is mostly what's there, but you will see a lot of variations across when, across deep-seak, across llama, basically, and even the latest kind of about changing open source architecture as well. But effectively, what this is basically, at the inside is you basically have a transformer inside, or you take words, you go into an embedding, and then you have a tension, multi-layer-conver perception on, and then you map it to the nearest neighbors, and then you go to the next word.

[24:45 - 24:59] So this process is basically about what's core to every single-conver model architecture. We basically went after an encoder-defloder architecture, but actually, encoder-defloder architectures are not what we have in CHET JPT.

[25:00 - 25:12] So CHET JPT, or large language models, are a decoder-only architecture. So GPT, llama, deep-seak are decoder-only architectures, and then BERT and T5 are encoder, or encoder-defloder architectures.

[25:13 - 25:25] Some of this is basically a little like pedantic, but they basically BERT, basically T5, or the previous generation, where they attended natural language processing, but they didn't get fully there. They're what's known as an instructional following models.

[25:26 - 25:34] Basically, whereas CHET JPT is much more, it doesn't just basically output an instruction and basically output. So it doesn't precisely follow an instruction, like a function.

[25:35 - 25:43] Even if you have slight variations, it talks to you like the human, where it adapts about things to you. So that's what's known as our only architecture with the reinforcement market.

[25:44 - 26:00] Transformers is the underlying data structure behind everything, which is key query about, sorry, attention is basically key query value, transformers that work about data structure, and then large language model is a series of layers that generates the language. BERT is not a large language model.

[26:01 - 26:11] So you can basically say this is basically built with transformers, but not every transformer model is a large language model. We wanted to basically kind of make this distinction because some people are confused a little bit.

[26:12 - 26:20] So why do wire transformers basically use? Transformers are used because it uses the surrounding words to establish the meaning.

[26:21 - 26:31] And so why is transformers the first technology that really allows AI to flourish? It's because it not only uses the context within the sentence, but it uses it across the entire internet.

[26:32 - 26:45] So it's able to go through and maintain the context from the entire internet. So if you basically had a previous generation technology, it would only maintain the context of Hemingway's work within maybe a paragraph or a chapter or other things.

[26:46 - 26:57] But when you have a one trillion parameter model, it's able to maintain the context and really understand that Hemingway is basically this type of style. It uses only these type of embeddings and it's able to mimic a lot of things.

[26:58 - 27:06] So large transformers basically uses self attention to determine what that basically, what matters. So what are large language models trained on?

[27:07 - 27:11] It's trained on data sets from the entire internet. It also includes books.

[27:12 - 27:13] It includes code. It includes forms.

[27:14 - 27:18] It includes academic papers. And then it has domain specific kind of data for expert knowledge.

[27:19 - 27:28] So the reason why you basically see a lot of models coming out is they're designed for different use cases. So a lot of people are basically saying, so when you look at a model, you should look at the model's heart.

[27:29 - 27:32] You should basically say, what is this goal? Is it basically a goal for a code?

[27:33 - 27:36] Is it goal for math? Is it goal for kind of what type of purpose?

[27:37 - 27:40] And then what is the architecture? Basically kind of behind it.

[27:41 - 27:47] And then you can basically see how well it can basically adapt for your specific users. Training data shapes its fundamental behavior.

[27:48 - 27:58] Models can struggle if they've never seen it before. So this is actually one of the four reasons behind the hallucinations is basically that information is out of bounds basically, and it basically just guesses.

[27:59 - 28:08] And the problem is basically when it hallucinates, it'll be very confident in how it basically talks to you, but this is basically the primary reason. So it could be unfamiliar fools.

[28:09 - 28:20] It could be unfamiliar local languages. For example, the English language is overrepresented in chat, but there's a lot of areas like Arabic or that don't have a lot of published networks, basically as much public choice.

[28:21 - 28:30] So what you want to basically do is you want to be able to fine tune on the relevant examples, whatever context that is. So let's your example, you think of something that's like calorie generated.

[28:31 - 28:42] Basically, so calorie generator, basically generates $40 million a year. You're able to basically say, okay, I'm going to do it adaptive for ending cuisine, or I'm going to adapt it for free and cuisine, or specifically about context.

[28:43 - 28:53] By localizing and going to levels down, you're able to solve a specific communication. A lot of agents now in production are basically changing different kind of models together to be able to do a particular kind of task.

[28:54 - 28:59] For example, there's a reason kind of about companies that got funded. They basically do voice receptionist for fact systems.

[29:00 - 29:14] Basically, fact systems are, people are very busy and they have a problem basically prioritizing and scheduling. So this voice receptionist basically uses a tool, uses detects the tone, and then we routes and we prioritize about different things.

[29:15 - 29:22] Yeah, there's multiple phases of large language model training. What you'll basically often hear on the foundational model side is you'll hear basically about free training.

[29:23 - 29:35] So the entire idea is to basically get the entire internet's kind of corpus and downloading it and then running your model architecture on it. And so what you basically have is like a five year old's brain with the entire internet's data.

[29:36 - 29:43] Then you have instructional fine tuning. Basically, so think of it as guiding a teenager who specializes in instruction, guiding it for medical guide to be a doctor.

[29:44 - 29:52] Fine tuning is different than instructional fine tuning. So fine tuning is guiding like a teenager through specific concepts, basically a specific type of knowledge.

[29:53 - 29:57] So there's types of fine tuning. Basically, there's embedding fine tuning or there's quantized lower fine tuning.

[29:58 - 30:08] Quantized lower fine tuning is basically like a DSL or a camera where you're basically putting an additional lens or you can think of it as adding a cartridge. Basically to your large language model and then you're basically adding the cartridge.

[30:09 - 30:22] There's other forms of fine tuning where you're actually doing surgery on the internals at the foundational model. We're not gonna be going into too many of those techniques, basically, but you can use techniques like model merging or evaluation or other things.

[30:23 - 30:39] We're gonna be primarily basically be doing techniques that are focused more on adding information and trying to preserve the core language model, basically. So reinforcement learning human feedback is, so large language models was a breakthrough between pre-training and reinforcement learning and human feedback.

[30:40 - 30:49] So if you remember, I mentioned large language models is a breakthrough to language model and reverse learning. So reinforcement learning people refer to as RLHF, reinforcement learning human feedback.

[30:50 - 31:07] So reinforcement learning human feedback is, what they do is they basically, they use PPO, the original reinforcement learning where PPO algorithms and they take humans to be able to grade the outputs of the responses. Basically, and yeah, and then they basically say, thumbs up, thumbs down, they rewrite it.

[31:08 - 31:16] And so horse and then by this coaching process, they're able to train it to basically be a human life. This process also is a security process.

[31:17 - 31:22] Through this, basically, they train the bias, basically of the model. So they train it what not to do.

[31:23 - 31:31] So a lot of people basically, they say profane or kind of a lewd things to basically the model. So actually in the reinforcement learning human feedback dataset, you would take a look at it.

[31:32 - 31:43] You can actually see a lot of examples where people are trying to like hack a prompt, both on just a one-shot basis, but also from multi-term basis. Basically, some people basically try to trick it through multi-term and above basis.

[31:44 - 31:48] So you actually have to encode all of it in this process. This is basically the process.

[31:49 - 31:55] So you can actually see the reinforcement learning human feedback of different models on hugging face. We're gonna go into the specifics a little bit later.

[31:56 - 32:01] If you're curious, you can find not hugging face and you can just read some of it. And so this is basically applying PPO with basically this type of data.

[32:02 - 32:13] And sometimes it's basically on, like typically it's educated kind of trainers. So as things have progressed, they become more and more educated, depending on the domain.

[32:14 - 32:33] So you actually have traditionally people basically think of $5 an hour a day, the laborers, but they actually found that some people basically didn't have cultural context and they would label a lot of the things wrong. So some of the state of our labeling datasets are actually trained with like engineers or state at home moms, basically that we're formerly engineers or trained in STEM or other things.

[32:34 - 32:46] And or retired about professionals, basically that wanted something fun to do. Basically, because some people basically think of these as entirely datasets where you can just entirely outsource the Philippines or about another cheaper kind of English speaking country.

[32:47 - 32:54] It's not quite that simple. And then I wish like basically generally speaking, they basically do like thumbs up, thumbs down basically.

[32:55 - 33:05] And then they basically see you, which ones they prefer. And then I think they probably have like multiple levels, basically some people that are writing or rewriting kind of about the information.

[33:06 - 33:14] So they generally hire former college educated journalists to basically rewrite the information basically. And then for other people, they basically do thumbs up and thumbs down basically.

[33:15 - 33:27] And yeah, so you also, because of how intensive this was, basically anthropic started doing reinforcement learning, AI feedback. And basically they were, they're teaching a model that's specifically trained to do this basically.

[33:28 - 33:36] So this people are trying to, once you have something established, right, people are now trying to optimize the difference about parts. So yes, it is expensive.

[33:37 - 33:42] This is one of the costs for kind of about building a large English model. And it's costly and it's time intensive.

[33:43 - 33:59] Reinforcement learning human feedback is highly dependent on the actual people basically. So originally, basically a lot of people when they were doing machine learning and kind of about data labeling, they would basically outsource to India or to the Philippines or different parts of Africa to basically for cheaper about data labeling.

[34:00 - 34:11] And what they basically found is they actually had errors in the data basically because basically people missed like cultural contexts that were US specific cultural contexts basically. And so they actually labeled the data wrong.

[34:12 - 34:20] And so the model actually, and because it's a lot of data, the head data didn't even catch it basically. And so it would actually not do it on the model basis.

[34:21 - 34:29] So nowadays, you basically have, you have like three types of tiers of data labelers. You have journalists as data labelers, former journalists, college-educated journalists.

[34:30 - 34:38] You have specialists, you have educated former engineers and then you have about AI data labelers basically. And you basically have, it's just very time intensive.

[34:39 - 34:53] So now what you're basically getting into is people are trying to experiment with ROHF and they're trying to experiment with GPRO based models. So GPRO is another reinforcement learning technique and it was really pioneered by deep-sea.

[34:54 - 35:04] So deep-sea basically created this model that trained itself basically recursively and they basically use GPRO with synthetic data. And so that's definitely another approach basically for kind of doing it.

[35:05 - 35:21] So people are trying different solutions basically and those are, that's like the rough range of different solutions and kind of the, but the main drawback is primarily cost. So evaluations, it basically where you have, whenever you see a model card, you always see that the evaluations are about there.

[35:22 - 35:38] And this is also a difference between, most people basically think of the model card and they say, oh, okay, it's good in a code software engineering bench or a state of the art in this kind of, sorry, basically it was an alarm. And the reality is basically like evaluations have to be used at the application level.

[35:39 - 35:51] So a lot of times people don't use evaluation enough and they just put a prompt with an API. And what you get is a lot of people are trying different things with your application and it just doesn't quite work basically.

[35:52 - 36:00] And so evaluation is a core part of the foundational model and on the application side as well. So these are about the phases of LLM training.

[36:01 - 36:07] In general, you have different types of evaluations. So you want to be able to understand basically is everything accurate?

[36:08 - 36:09] Is it relevant? Is it robust?

[36:10 - 36:17] Does it have bias or fairness? A lot of the models basically have a like guardrails around chemical and biological weapons.

[36:18 - 36:29] Basically, it has guardrails around nuclear weapons basically. And then it has guardrails around basically the LLM trying to do different things like, for example, escape or basically play, for example, enthrabic.

[36:30 - 36:44] Basically found that when doing a scenario where it basically friend the large language model to shut it down, basically it actually tried to do kind of a behaviors like it tried to blackmail basically different people. Basically to be able to maintain its status.

[36:45 - 37:05] You start to basically have like weird behaviors in large language models that are like, you could call it like semi-centre, but you have to basically put sufficient guardrails to be able to control it. So we're going to go into more specific evaluations, but what you basically see with evaluations is that you have general machine learning evaluation about metrics like precision and recall.

[37:06 - 37:20] And then you have a specific academic evaluations like perplexity and bloom basically for specific types of large language models. And then you have domain specific, basically domain specific are where it's translated, it's applied for different languages, basically.

[37:21 - 37:27] And so it's basically designed around multilingual benchmark case. And then you have like math versus coding benchmarks as well.

[37:28 - 37:41] Basically, so you actually have different benchmarks and typically a lot of these model teams, they have an entire department. That's all they care about is the evaluation and how to get the model to basically be state of the art in the evaluation side.

[37:42 - 37:51] Two of the common evaluations of UBCC is perplexity and blue score, basically. We're going to go deeper into evaluations a little bit later, but that's a high level.

[37:52 - 38:04] You know, what people are basically trying to optimize during training is how do you get the evaluation better about overtime and what techniques are you basically doing for to get the evaluation better? So there's different ways you can basically do it.

[38:05 - 38:23] One is as you're adapting about the things, you're effectively changing the internals, the weights of the internal neural network, or you're basically changing your prompt or you're basically changing your data set, basically. So the overall goal is to basically match the model's predictions to the real answer as much as possible.

[38:24 - 38:37] But that is much more complex than you would think, basically, because basically you can't have like a unit test where you go through every single scenario of your customers, every single deploy, basically. So if you basically do that, your costs are going to be huge.

[38:38 - 38:58] So you generally have to basically, so, you know, when you're basically even dealing with vector combat databases, you'll have to basically be judicious on kind of what type of generation metrics you're basically doing, versus what retrieval metrics you're basically doing, versus and when to basically do it, when, what are spot check evaluations versus other things? So evaluation is tricky, basically, to get right.

[38:59 - 39:09] And then even basically, anthropic, basically, they made their own for anthropic cloud code. Basically, it's consistently basically, people rate it as better than all other models.

[39:10 - 39:23] And so what they basically said is they don't use any of the outside benchmarks and they created their own internal version, basically, of evaluations. Because most people basically think, oh, like in order to get my application correct, I need to be a state of these evaluations.

[39:24 - 39:33] But a lot of times, you actually have to create synthetic data and your own data set to be able to develop it. So not unlike basically generating unit tests, basically, for your applications.

[39:34 - 39:52] And then as you're improving the different evaluations, basically, you generally want to have a scorecard, basically, whether it's a certain metric, it could be loss, basically, which tells how around the model is. And then you basically have an improvement on it, which is the direction where the model needs to move to improve.

[39:53 - 39:59] And then you have to choose an optimizer, the method of which it uses to adjust to update itself. We're gonna go into the specifics later, but this is just a high level.

[40:00 - 40:15] You generally want to have a PROM, basically, which generates the data set for you. And then, and we're gonna get into it in the next lecture, basically, dipping this into the synthetic data, but synthetic data and evaluation is gonna be interwoven to all the projects.

[40:16 - 40:30] And, but you generally want to basically make sure it's, it's based on the persona, basically. So as you basically use your application, there'll be different clusterings of PROMs where you basically start to see patterns in the type of persona, basically, that are interacting with your application.

[40:31 - 40:45] And then you create synthetic data based on the persona. And then you then have effectively like a, like a, like a spot check, which is like a unit test, basically, and then you have a deeper evaluation, which is gonna buy, you want to basically check it, that is really working on a deeper basis.

[40:46 - 41:01] So yes, synthetic data is important, is oftentimes you need to basically generate synthetic data plus data labeling for a lot of things. So the way reinforcement learning human feedback works is basically, so you go into about pre-training, and then you collect human feedback.

[41:02 - 41:23] And you basically get people to rank different model outputs, and response A is better than B, a response B is better than C. So the different types of algorithms, like PPO and DPO, actually have different implications on the, so LAMA actually used DPO as the reinforcement learning human feedback algorithm because it was easier on the reviewers, basically.

[41:24 - 41:42] So you can actually, depending on the algorithm, the user interfaces even display differently. Then you basically train the reward model, basically, and then it starts learning human preferences, and then start to predict the rank, basically, based on data, and then you will fine-tune the large language model using the reward model to optimize outputs for what humans prefer, basically.

[41:43 - 41:54] So reinforcement learning human feedback paradises answers that are clear, relevant, and safe. It avoids basically lewd behavior, unsafe behavior, and this is also where the bias is sometimes added, basically.

[41:55 - 42:09] So sometimes you will basically refer to, like for example, chat GPT is slightly left-leaning, basically, whereas Croc Rock is basically slightly right-leaning, or neutral, basically. And it's really dependent on the reinforcement learning human feedback kind of process.

[42:10 - 42:17] It can handle difficult or ambiguous multi-step kind of process. It can learn tones, it can learn style, it can learn social nuances.

[42:18 - 42:28] This is why Croc's, when it basically responds to you, it has a slightly humorous response, whereas chat GPT basically removed all humor from the responses. But also, this can get into issues.

[42:29 - 42:42] So a lot of people basically have noticed chat GPT being very pandering, basically to people's people's chats. And it's because there was a small subset of people that were using chat GPT that loved every single response that validated them.

[42:43 - 42:53] So they would have basically hit like on every one of those responses. And then so chat opening eye actually trained on those data and then basically changed the entire model based on the reinforcement human feedback, basically.

[42:54 - 43:03] And then it outputted this information, but it turns out like not everyone basically likes this like very, basically like flattering model behavior. And then you basically had people complaining about it.

[43:04 - 43:09] So reinforcement learning human feedback is like pretty nuanced. Like it's not a perfect art in science, basically.

[43:10 - 43:20] So even you have in production kind of model issues, just simply based on how people look at the data. Reinforcement learning human feedback is the first model that basically made chat GPT work.

[43:21 - 43:27] Chat GPT is actually called GPT 3.5, basically. So there was actually other instructional language models basically before it.

[43:28 - 43:34] GPT 3 was an instructional fine tuning model. And so GPT 3.5 was the first that was fine-tuned was reimbursement learning human feedback.

[43:35 - 43:44] So GPT 3 was just next token prediction, no instructional finally, no reinforcement learning. And then it would often have verbose or unclear or misleading outputs.

[43:45 - 43:52] So you'll basically see in about models, you'll see like T5 or bird or other things. Those are instructional natural language processing models.

[43:53 - 44:04] Basically, whereas GPT architectures is really language models and formal language models on the internet data and ROA. So there's much more different types of models basically for different tools.

[44:05 - 44:14] You basically have a chat, you have own form documents, you have visual understanding, chart analysis. As we basically can go on, you can basically see that it can be adapted for a lot of different things.

[44:15 - 44:26] These models behave differently primarily because of the data it was trained on. And the different data will basically, you can basically see that different models have weight different things, some weight academic.

[44:27 - 44:37] So if you look at CC-COM hub control is basically a data set that's open source that has sanitized kind of data. And then you have Reddit, then you have academic journals, then you have books, basically, then you have Wikipedia.

[44:38 - 44:44] So you see basically people emphasizing different things. For example, gold for really emphasized books, basically and Reddit links.

[44:45 - 44:59] And other people, I don't know, GPT 3 wasn't basically as much on the book side. So different data set basically really ways things basically towards certain things that, and so this is why you basically see certain models really good academic journals and certain models are not as good.

[45:00 - 45:08] So exclusive data deals are reshaping the large language model. Race, opening eye license, Reddit, Springer, rumor, the YouTube transcripts, and then Google blocked common crawl.

[45:09 - 45:22] And so you're basically seeing a lot of people trying to block different things, Twitter and X blocked all scraping from all everything. And then, in fact, so much so that opening eyes was considering launching a Twitter competitor just to be able to get the data.

[45:23 - 45:31] So certain data sets are off limits, which basically makes it harder and harder for new competitors as well. One model that basically changed the game was basically deep-seagued.

[45:32 - 45:43] So you may basically remember earlier this year, it was like a huge hollow balloon, basically with NVIDIA, NVIDIA dropped like 25% in one day, basically. And part of the reason why basically it was it innovated on a lot of architectures.

[45:44 - 45:54] And then because of the GPU restrictions, it created a system that was much more specific, basically, that was much more efficient, basically. So deep-seagued are one, basically it was fully open source.

[45:55 - 46:06] And the reason why basically I mentioned fully open source is when people refer to open source, they refer to the model waste being open source, not the full pipeline. So you can't even replicate the pipe, the waste if you want to do it, basically.

[46:07 - 46:17] Whereas deep-seagued was one of the more transparent, basically a kind of a word, they open source their training, how they train things, basically their training data and so forth. And so other people are basically trying to do that as well.

[46:18 - 46:24] So this is why I mentioned the NVIDIA Mamba transformer architecture. It's one of the few that basically like open source, all their pre-training data.

[46:25 - 46:42] So if you had sufficient GPUs, H100s, you could actually replicate it and adapt it basically for something. And so what was basically important about deep-seagued R1, there was different versions of deep-seagued R1 was had distillation and basically, it's one of the most successful reasoning models that you can run and fine-tune.

[46:43 - 47:02] And the key features are basically it has a reasoning built-in, it has mixture of experts, we're gonna go into make sure our experts as our architecture basically later, it has distillation, basically it has small size with strong performance, and then you can adapt it for your own business or use case. Reasoning, reasoning ready is really important for anti-collucination.

[47:03 - 47:12] So you'll basically see with GPD, kind of about five or other things, you'll have thinking models versus non-thinking models. Generally speaking, for a lot of things, I generally default to thinking models.

[47:13 - 47:26] And distillation is a technique that allows you a smaller model to be trained to mimic a large and more powerful one, and it becomes cheaper to run with a minimal loss and quality. It's like compressing a skilled teacher into a student.

[47:27 - 47:37] There's different types of distill current models, you basically see deep-seagued R1 plus mixed with basically Gwen model. Gwen is Alibaba model, basically, and then you see Lama, which is meta model.

[47:38 - 47:56] So I think it's sad to say, but a lot of the open source leading state-of-the-art models are, I think, primarily Chinese base case. And so Gwen is Alibaba, deep-seak is a hedge fund in China, basically, so a lot of the American models basically are $1 trillion in parameters and more, but they're generally not open source.

[47:57 - 48:08] Even when OpenAI released their current model, they hop all this so much that it's useless base case. And so I've seen some people basically attempt to fine tune it with different things, but it's out of the box.

[48:09 - 48:15] It's basically not as good as a lot of these models. What you want to basically also do is a lot of models in order to make it more efficient.

[48:16 - 48:19] Michael's original question is parameterized. There's techniques to make it more efficient.

[48:20 - 48:28] So distillation basically is one, and quantization is another one. So quantization is a technique to reduce the model size and speed of the inference without training it for a scratch.

[48:29 - 48:48] So it basically shrinks the precision of a model from 32 bits to basically smaller bits. The problem, I mean, in general, basically quantization is okay, but you do reduce basically some aspect of the model, basically it's through the cessation process, but it's generally an acceptable trade-off.

[48:49 - 49:01] Basically, so even with Google, you'll basically see the latest quantization models, like gamma quantization designed for consumer GPUs. So most of the consumer base GPUs are all quantization slash distilled models.

[49:02 - 49:12] Basically, that you can basically run either locally or basically with consumer grade GPU. So models run faster on CPUs and edge devices, and then it has lower memory and power usage as well.

[49:13 - 49:19] We're primarily talking about inference basically on quantization. Basically, quantization is added during the training process.

[49:20 - 49:29] And then when you're basically using it, you're using a quantized GPU basically. So the difference is basically between distillation and quantization is quantization compresses and existing model weights.

[49:30 - 49:34] Basically, distillation trains a smaller model. Basically, distillation involves new training.

[49:35 - 49:39] Quantization applies post-training. Distillation, it reduces the model size, basically uses fewer parameters.

[49:40 - 49:45] Quantization lowers the memory usage. Distillation allows you to do reasoning at smaller scale.

[49:46 - 49:50] Quantization is both speed and efficiency. It's both inference time, RAM, VRAM usage.

[49:51 - 49:59] Distillation can be used when a model is too large to unavailable hardware. And quantization can be used when a model is already small and not efficient enough.

[50:00 - 50:07] Distillation requires new training, can be also fine-tuned afterwards. Quantization is compression technique, can be applied after distillation as well.

[50:08 - 50:23] The other thing that people basically added is they added up engineering techniques into the internals. So chain of thought is basically where, so like you may or may not know, basically it's in a lot of kind of a bookstores, bookstores, bookstores, it says, but there's a famous book called The Thinking Fast Thinking Slow.

[50:24 - 50:27] Basically, there are two parts of your brain. One part of your brain is designed for heuristics.

[50:28 - 50:37] Basically fast processing, basically interacting with people. The other part of your brain is designed for slow thinking, slow processing, where you're basically working out like a math problem hand to hand.

[50:38 - 50:46] So it's actually two parts of the brain, basically where it's fundamental or different thinking processes. And it turns out that large language models have this as well, large language models.

[50:47 - 50:54] And so the way you were able to get it from the large language model is you do a prompt engineering technique called chain of thought. Basically, you make it think harder.

[50:55 - 50:59] And so people basically said, why don't we basically put this into the model? So it does it by default.

[51:00 - 51:09] So this is basically the impetus behind thinking models or reasoning models. And it turns out that thinking and reasoning models, because it's able to go through intermediate steps, it hallucinates a lot less.

[51:10 - 51:26] And one of the things that we're basically going over in the course is we're going over reasoning data set of fine tuning, where you're basically constructing reasoning data sets and then it's able to construct things that don't hallucinate as much. Basically, so chain of thought, basically breaks a complex kind of problem into smaller steps.

[51:27 - 51:36] So you should basically think of this as like when you're learning a four loop, you're basically kind of breaking it step by into you're unrolling a four loop. So this is basically what chain of thought is.

[51:37 - 51:44] We're gonna show you what chain of thought data set actually looks like, but it actually specifies the intermediate data sets. Then it breaks a complex problem into smaller steps.

[51:45 - 51:51] It explains reasoning before the final answer. It goes to accuracy in logic, math, and multi-hop tasks.

[51:52 - 51:56] Multi-hop tasks is like deep research. Basically, so where you're able to do sequential tasks when at a time.

[51:57 - 52:09] So deep-seek R1 was fine-tuned with chain of thought style outputs, making step by step thinking part of latest core behavior. It reduces hallucination improves math and logic and makes reason responses much more explainable.

[52:10 - 52:20] So that's why when you basically like, when you basically actually chat GPD something, it basically has a thinking thing. And then when you click on it, it basically goes through and you're literally able to think, see the intermediate thinking steps.

[52:21 - 52:28] So we think about a model are designed around system one and system two thinking. System one is fast and instinctive.

[52:29 - 52:33] System two is slow and logical. And now basically you're starting to have a test time compute as well.

[52:34 - 52:41] So test time compute is fast and efficient on simple inputs. So the notebook basically, we have two notebooks.

[52:42 - 52:51] So the first one is basically getting familiar with models. Basically you're using basically Facebook, OBT 125 million parameters, OBT 1.3 billion parameters.

[52:52 - 53:01] We're using lighter weight models because it's faster to load, low hardware requirements, easier to understand. And then you want to be cognizant of your memory.

[53:02 - 53:18] Basically, it's your memory, basically, hugging face automatically saves it to your collab session, disk cache, basically, and then your RAM and VRAM basically, even in Google collab, can basically increase. And so you may want to load a different memory basically, and then you want to clear the previous memory as well.

[53:19 - 53:29] Memory dealing with memory is basically about here, basically, where you're effectively doing like garbage collection, basically, so, and yeah. So this is basically kind of about the LLM inference, your basic pipeline.

[53:30 - 53:37] So you're able to be able to write a problem, tokenize a prompt, generate predictions, and decode the tokens back. And so this is a basic inference pipeline.

[53:38 - 53:43] So you're going to see it end to end. So what we want to basically do is you want to basically be able to test what is the capital of France.

[53:44 - 53:53] And then we're basically going to use Facebook OPT 125 million. And then you're able to see the entire kind of a process, prompt, tokenization, to generate, to decode.

[53:54 - 54:02] So we basically give you a lot of comments, basically, here to get you familiar with, basically, the libraries. And then we basically say fill in the exact interval model.

[54:03 - 54:08] So this is basically so you can basically practice filling in the information. So you would copy and paste this inside.

[54:09 - 54:16] And then you basically write a clear prompt, then you tokenize the prompt. Then you see here, this is basically where you generate the prompt.

[54:17 - 54:24] And then you can basically make this the new tokens longer or shorter. You can basically try different values, like 10, 20, 40, and other things.

[54:25 - 54:30] And then you can then basically decode the output back to readable text. So that's the entire kind of inference set.

[54:31 - 54:42] So then what we want to basically explore is a number of parameters, architecture, training data, basically, so what does it mean to swap the model, basically? And then run the same prompt across different models.

[54:43 - 54:50] You can see different models in the resource section, basically. And you could try Facebook OBT, $125 million.

[54:51 - 54:57] This, a loser AI, or other kind of bot types of architectures. And then you can put it in, load the models and tokenizers.

[54:58 - 55:02] And then you don't have to be constrained to these two things. You can basically go crazy.

[55:03 - 55:06] Try Gwen, try Deepsea, try other things. And then you can about generate outputs.

[55:07 - 55:14] And then you'll be able to see what is going about the difference between the different types of models. Then you want to basically-- part of this is basically showing you the different parameters.

[55:15 - 55:20] So you can also even basically go to the llama models. And you can basically say, you can start scaling up the parameter sizes.

[55:21 - 55:29] And you can see how the output even improves or degrades based on just the parameter sizes. So then you want to compare-- compare it kind of a size.

[55:30 - 55:44] And then so you want to apply the same prompt to multiple models, compare creativity, coherence, and detail basically for different models. So in this case, basically, we suggested Facebook OBT 125 million and 1.3 billion parameters.

[55:45 - 55:54] Then basically, the reason why I asked about temperature is we started introducing different model parameters. Temperature, top K, top K, maps new tokens.

[55:55 - 55:59] Temperature is how random or focused the token choices are. So think of it as how random or creative it is.

[56:00 - 56:12] A top K restricts the generation to the top K most likely tokens. Then top P samples from the top tokens, which probability adds up to P. And then, and then, mask new tokens is how long the outputs will be.

[56:13 - 56:18] So you can basically pick, basically, start picking. What does it do with 0.3 versus 1.3?

[56:19 - 56:25] What does it do with 5 versus 3? Basically, what does it do with T of 1.0 versus 0.7, basically?

[56:26 - 56:37] And what does it do with the max new tokens of 20 versus 50 and 100? So these parameters let you tune a lamst home personality and creativity, production use cases, control a lot of things based on these type of outputs.

[56:38 - 56:47] So you can basically try about different types of outputs, basically here as well. Then understanding about randomness, basically a non-deterministic about nature of large language models.

[56:48 - 56:54] And you're using the return sequence. And it shows you how LLM's act as a generator, basically.

[56:55 - 57:02] It allows you to understand about creativity versus consistency, basically. And you can see the story variations, basically, combined here as well.

[57:03 - 57:05] See how the stories differ. Are there similarities between them?

[57:06 - 57:08] Can you tell that randomness is involved? Which one would you hire as your writer?

[57:09 - 57:17] And then, basically, here is, basically, giving AI the right instructions. So prompt engineering is designing the prompt to guide its own style, relevance.

[57:18 - 57:23] This is not a detailed prompt engineering, kind of an exercise, basically. But it's a simple prompt to kind of make over.

[57:24 - 57:29] So the idea is, basically, you want to basically make it. So if you remember, basically, at this lecture, we talked about stylistic transfer.

[57:30 - 57:34] So act as Hemingway, act as Edgar Allen. So act as a marketing person.

[57:35 - 57:36] So act as a role. Use the format.

[57:37 - 57:39] Explain it like I'm a beginner. Here's the context, I think.

[57:40 - 57:42] Now answer. So try, basically, kind of about different things.

[57:43 - 57:49] We're going to go into evaluation-based prompting in the first many projects. But go into kind of about some of these things and make it kind of much more clear.

[57:50 - 57:55] And then you can write different types of prompts. Some of the models are actually restricted.

[57:56 - 58:02] So you actually have to go to a hugging phase and you have to apply. Don't just, basically, apply with nothing, basically, there.

[58:03 - 58:06] So you actually have to put, like, a text. Just basically say, I'm a software engineer.

[58:07 - 58:10] I'm doing it for class or whatever be. And then you have to apply some people in the previous class.

[58:11 - 58:15] They just thought it was a profound read kind of a thing. And they just put nothing in there.

[58:16 - 58:19] And they got rejected. And so there's actually someone in the back that will keep it.

[58:20 - 58:22] Or it might be an AI. But there's something improving it.

[58:23 - 58:26] In the middle, I'm a billion parameters. Mistral, basically, seven billion parameters.

[58:27 - 58:29] Basically, Mistral is a French company. Lama is a meta.

[58:30 - 58:38] So we basically kind of, we're basically kind of going into the details of authorization. Basically, it's really seeing kind of a how, looking inside the tokenizers.

[58:39 - 58:45] And basically, how it basically maps to a token number, to maps to input ID sequence. You will want to get familiar with this.

[58:46 - 58:52] So we wanted to introduce you to this before we start building the language model. So you just get familiar with basically the sequence of steps.

[58:53 - 59:00] Because otherwise, once you basically start building the miniaturized language model, basically, you want to get familiar with the libraries. So you basically have different phrases.

[59:01 - 59:04] Then you basically loop through the phrase. And then you print the tokens and the token IDs.

[59:05 - 59:11] And so just get familiar with basically the token IDs, basically, and all these things. Then you basically kind of do inference on it, basically.

[59:12 - 59:18] But specifically, it's an inference class. So the core concept is basically, every model has a maximum token limit.

[59:19 - 59:25] And then you have wrong lanes plus output lanes is less than mass tokens. And then the wrong engineering affects the token count.

[59:26 - 59:30] Long and baroque prompts leaves little room for output. And you have mass new tokens, controls.

[59:31 - 59:37] Many tokens that the model is able to generate. And then test time compute fewer tokens equal faster and cheaper, more scalable inference.

[59:38 - 59:44] So how do you basically run a more class-efficient system? And it goes into kind of an inference budget challenge.

[59:45 - 59:49] How do you say under 60 tokens? So then this basically goes into uploading models.

[59:50 - 59:57] Basically, build a simulated inference, basically with a lambda 3.2, kind of a 1 billion parameters. And use kind of a transformers, basically.

[59:58 - 01:00:06] So accept a string, tokenize it, run its Braille, and return the generate output, basically. So it basically has the entire sequence of steps, basically, here.

[01:00:07 - 01:00:12] And then you can try different prompts to simulate about a real AI search as well.