A Glimpse into our Group Coaching Calls 4

This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only

Unlock This Course

Get unlimited access to Power AI course with a single-time purchase.

$Thumbnail for the \newline course Power AI course$

[00:00 - 00:22] So yes, how are you doing with your exercises at any projects and like any? Yeah, I was. Yeah. Yeah, I'm a bit like, basically, the time when lectures happen, right? Like it's, it's my right thing. Yeah, most of the time, like I'm exhausted. But generally, the way I do it is in my daytime, I go through the recordings again. And then then start to go through the exercise.

[00:23 - 01:32] Those I mean, yeah, I think I started on the mini project one, but yeah, and didn't get much time to actually complete it. So yeah, hopefully, in this weekend, I'll finish that, finish that. So I, yeah, yeah, that will help you to give you a general sense of integrating things in your own personal project. For example, today, we discuss today, we discuss about the customer thing, like the customer is sentiment analysis. And then basically in here, like before, you can just aggregate like voice data or something. What you can do over here is without worrying the voice modality, you can just create a synthetic text data for customer base and then more complete workflow working. And then basically, you can integrate the voice modality after on top of it and then see how it's working. And then I just, okay. And then I just parameters with the real data set. Okay, okay. Yeah, yeah, epic things. Yeah. Hey, Michael.

[01:33 - 02:06] David, how are you? I'm doing great. Busy. I was working a lot on my project the last couple days. So I had a meeting with my customer today and shared them some output. So it was good. It was good feedback. And I can talk to you about it. We can show you some of the stuff I ran into, but got to get back to the homework and the exercises now. Yes, you can basically talk about it right now. That's the purpose of a group project coaching call. And you can also talk about it, like on our one on one's, if you think, yeah. So anything is fine.

[02:07 - 02:19] Like it's okay. Well, I'm happy to share. So I was just going to show Cindy and Barry thinking way in too on working on the product enrichment. So I wound up building the graph in lane graph to use that.

[02:20 - 02:45] So I've got a whole set of steps, but it spits out actually the steps and the stages it goes through. So the way I've set it up, I copy the, this is like the sub graph for doing the product enrichment stuff so it can run in parallel. Yes. Any product that comes in, I copy the input to what I call a scrape product. I do a search with fire crawl to define the brand product page.

[02:46 - 03:41] Then I scrape that page, then I generate the product description features, and then I have another step where I standardized on the specifications. So I did use Lance DB and I loaded in all their 6,000 values. And then I add that back into a state. I guess if they have you use a reducer since all these subagents can run at once. And then the full graph, let me see because I think that's interesting. This is what it looks like. So it starts out and create a run ID. I summarize the spreadsheet, and then I ask the LLM to analyze the columns. Yeah. And then it creates the mapping. So I was trying to avoid like a manual mapping exercise. And so then it takes the input and copies it to the internal structure. And once that's done, then it uses like the send API kind of to call the sub graph. And so this can spawn off like however many. Yes. And that's the sub graph.

[03:42 - 04:07] And then it prints a summary, writes all the stuff to CSV. So I had a job I ran last night. It took almost an hour. I think the longest step right now is this specification standardization because I have to ask the LLM to help determine out of the set of allowed values does the one that we scraped match any of those. And that's where I'll show you where I was.

[04:08 - 06:10] I was trying to experiment with how to do the search in Lance DB. So I wound up. I guess this gives a visualization of what it looks like. There's a there's an attribute type called wait. And any of the allowed values appear in the column. So it's like 0.3 pounds. You can imagine this could go on almost to infinity. Yes. And then what I was trying to come up is what should I embed for the text in the first time when I was just embedding this and I would do a vector search. It wasn't very accurate because it just sees the numbers and that thinks they're all the same. The reason is like you never search for numbers in embeddings. The reason is number has the dependencies. Once it converts it into the embeddings, it's it's not actually converting the numbers, but it's converting that number with the mathematical formula. It's using to convert that into vectors. So it has the dependencies among all these numbers. Now, when you search embeddings for numbers, it will not have accurate reasons. So what you do over here is you have the metadata and additional metadata to search for the numbers. If you want to do like an exit search of specific thing, then you would have to have an extra metadata while embeddings are more on searching contextual information. So for example, if you want to search for, for example, if you want to search for there is a plethora of tags of description, like a product description. And then that product description mentions a specificity of that product that it's a night issues with white color and blue color and all those. It's a plethora of tags. Now, if you want to ask, if you want to ask a question, what 's that one shoe with white color and an average weight of 0.3, then basically it will return us that specific chunk because it holds like all that information embedded in the vector.

[06:11 - 08:01] I'll show you what I wound up doing. So when I did the embedding, this was like chat TPT input, but it said, do something like call it attribute type colon average weight attribute value colon then the number. And so like when I run the search, like, if I do a semantic search, so in this case, when I search for, I search for two pounds and 0.2 ounces, this was like an exact search. And you'll see it, it did a match on that one. It actually got it. And then the fuzzy search was just using the two numbers. And that one didn't hit the record. So I was trying to play around with it at the end of the day to see which one works best. But you can see, like in this example, I tried to do a re-ranker. It gave me a whole bunch of stuff, but it didn't work because it rated two pounds, 0.8 ounces ahead of two pounds, 0.2 ounces. So the logic, and that wasn't correct. So the re-ranker, basically re-ranker is a module that is used. So you have a good result, if you go above, you would have to, you would have to find out the MRR and the all types of other metrics that were in the MRR. There was these other metrics. I forgot like there were around three or four metrics in the re-rank lecture. You would have to basically find out the heat rate at the rate five, at the rate three, and at the rate one. And then you see these heat rates, basically, for example, if they're like right now, you are searching for average weight of two, what's the other query that we can use over here?

[08:02 - 08:13] Like, you mean, can we change the query to something different? Yeah. So for example, if we use some other query, then it will give us a result in different manner.

[08:14 - 08:57] But now we will be able to see the MRR and all the other scores, recall at the rate five, recall at the rate three, recall at the rate one. So if you see there is a recall at the rate three is heating like the highest. So now what you do is now you re-ranker with the consideration of recall at the rate three, such that you can push it to recall at the rate one or two. Okay. Or basically usually what I see in traditional cycle of AI development is , it's always heating recall at the rate five, like the highest is always recall at the rate five.

[08:58 - 12:18] That means the golden chunk is basically within the first five chunk. Yeah. And then we use a real anchor for specific query and specific chunk in order to thank the passage from five to one or two. Yeah. I'll probably book a time if you need to talk through this in more detail, because a lot of this was really complex. And like I said, like the re-raker code, like literally came straight out of chat GPT and said try this and I have to say I didn't understand it. But I don't know it was an interesting kind of exercise for me to do this just because I got to see like why this wasn't working with the numbers and how to approach it differently. But I know if I could solve this, it would make this other stuff so much faster, because right now it just takes a long time to pass some of these values into, I can show you my prompt. Let me see where this is great. That's my prompt for creating all the description and features. And then I separated the one for the specifications. I have the guidelines, but then I inject pretty much any of the allowed values in. I pass up the description and I give it the features so it can determine but I do that over and over again. And sometimes the product does 22 different specifications and it takes a while to do this. I got good feedback today, the customer sent the descriptions and the output was good. I was really happy with that and improvement over what they had. Yes, basically we have to like, we'll be doing like lots of iteration in order to right now your your concern is time, right? Yeah, and they had given me some input and I didn't want to take like more than a week to get it done. So I was trying today was our meeting and I want to get it to them. So it was interesting. The other thing I thought about too, like I think it's been a good experiment for me to try laying graph. Yes, but like that mechanism where it calls sub graphs using the send API and it just it it fans out almost immediately. So you can't from when I can tell control the rate at which it does that. And so I think the other hurdle I'm running into is that sometimes fire crawl gives me like a 500 error. And it's either it was really hard to figure out what's going on. But I had to use their stealth mode because I think some of these sites like Patagonia were doing bot prevention. Yes. So I was I think I was losing some of my scrap ability during the run. So you might need a validation agent or a little mess job, judge the pipeline. Yeah, no, I think I will. I definitely need something that makes it more resilient because it seems like I'm going to get errors and I have to be able to yes down for it a day. So what we do is for the SEO bot, what we do is for the so sometimes what it does is it speaks out four headlines and the program is basically for five headlines.

[12:19 - 13:03] So what we do is we went up so at the initial stage, what we did was basically added a dummy value. If there is like four headlines, then add a dummy value on five and that just go on with it. User will basically reject that five headline because no one would be accepting the dummy value, right? But then it not looked good on the user inside. So what we did was we had a validation agent, which is basically every time the reason the main reason of validation agent was basically checking the duplicate sheet from the past. Like if that same headline was being generated, checking the headlines basically if the headlines match with our criteria, defined criteria.

[13:04 - 13:29] And the third is basically if there are five headlines, if not, then basically it runs the psychology. Once you add the validation thing is going to add the time that's for sure. But at the same time, you will have less multi-agent process. Did you, like when you got a headline back, did you store it in a vector database? Did you vectorize it or did you just exact text it?

[13:30 - 13:47] I just store it in the JSON file plus the database, simple database. Basically, that JSON file is used for the headline generation again when a user is clicking on generate headline. And the database is basically used for the front end of the presentation.

[13:48 - 14:17] But if like when you're using your example, maybe you get another headline that looks exactly like the other headline, it would have to be an exact match you're not doing is it semantically. So that's where I use the LMS judge. And then I have a prompt, which defines Okay, different range, different ranges on different regions on the specificity , where if the value belongs to that specific range, then it should just call back the whole inference.

[14:18 - 14:30] If it's like above, let's say if the match is above, I'm giving the LML itself for matching. The match is above 70%. Then basically, no need to do the headline generation.

[14:31 - 15:10] But if it's below 70%, then basically it has to rerun the inference until it generates the unique headline. And after time, it runs the inference. Every time it runs the inference, I'm passing the JSON file as a context. It says that it does not have the duplicates and the prompt itself also has the avoid duplicancy from the JSON. Yes, I'm curious, which are you using a consistent model or do you use different models for judge versus generation and that kind of stuff.

[15:11 - 15:50] So for generation, I use 3.5 turbo because it's cheap. And for the judge, I would go for bigger model because I need the validation to be more accurate. But for the tax generation, I use the cheaper model. The reason for that is we need, we might need to call the tax generation multiple times. And if we have higher model than it we cost more, but for the L ML is judge, it's not the bigger model does not have. You do not call multiple times for LML is judge. Yeah, makes sense. Thank you.

[15:51 - 17:16] I'm curious, Michael, about your use case because I'm doing something similar where I'm doing real estate stuff where I'm scraping like Zellow or realtor.com. But then I actually push everything, I fan it out into a relational database. And then I do like those category searches and enable that through LML functions. So then I have this cached layer of all the things I'm scraping. And then instead of directly embedding and then doing a vector search off of that, so I was just curious on why your use case in architecture was and why you chose like glance TV. I think where I was growing with it originally was just thinking, okay, if I have a bunch of specifications, right? And they're like, I can show you like what they like. I mean, again, I welcome feedback because this is an experiment for me. Let me find, yeah, I'll show you. So this is the input data they gave me. And you could see like they're just attribute type, so shop weight, you could see these are very numeric values, right, to all look very similar. But then you have stuff like skill level, right? And it's everything from adult , advanced, beginner, college, competitive, right, Robert type, for example. So they just vary dramatically.

[17:17 - 18:46] And so I think my thought was that if I put this into a database where I could vectorize the text, then I could do the hybrid search, which was what I felt lens TV could do. I could still do the traditional relational sort of match or like with a percent wildcard, or I could try to get the semantic meeting. And so I think my idea was that like run to queries, do the vector search, do the regular relational search for a keyword match, and then evaluate the output of those two. And that's honestly probably what my next step made me to try to talk to dip in again more about about this right now. All I've been doing is just giving the LLM the decision to take the input of what exists, the input of what it's great. And so you do any of those match, if they do, I'll use it if not, then I'm going to create a new one based on the rules. And they have rules like these numbers for example, where was the one that had ounces has to be OZ no period space after the number. So they're very particular about that kind of stuff. I don't know, very does that make sense? Yeah, totally. You have a much more diverse of those attributes, attribute values, stuff than I do. Yeah, it seems like a good place to start with experimenting and then, for me, yeah, it's not a pre optimization, like it makes sense from my equals.

[18:47 - 19:51] Yeah, I think the other thing that I realized I end up doing is you collect all the state that you enrich it, I'm going to end up with a product database for the company at the end of the day. It's going to store all this information. I was either going to put the final information in like MongoDB or relational database, because it needs to be. But the other thing I want to do just more as an experiment was load the products in the LANs DB, because I think at that point, too, I could do, I could offer them an alternative to search that's not just like traditional like solar surge that most e-commerce sites use or elastic surge, but have one that could actually embed the image for the product. So you could, if you're looking at products like more like this, that was one of the examples I saw on the LANs DB site. So I don't know, in your case, I think you're doing more of the documents on the real estate side versus like the actual property search. But I guess if you were going to say, I like this house finding more houses like this, you could show the image, right? And I know Lance DB would do that as well.

[19:52 - 20:29] Yeah, no, that's like my next step. So I'm not as far as where I'm getting personal preferences from people yet. Like I'm just categorizing and collecting the existing stuff. So I'm curious to have been like seeing like those, that data that Michael just shared, where you had all of those different categories, unless this is approach the best way to look at searching through all of that, like using Lance DB and that and vectorizing all that, or is there like some in between steps that might like, I don't know, make it clear or easier to work with?

[20:30 - 21:13] So it basically depends on the specific functionalities on what you are implementing. You would have to search. So what Michael mentioned earlier was he was searching for the ranges of numbers before as a query for with the embeddings. And it did not team because the embeddings has the numerical dependency. And it's usually a common problem in the machine learning too, where if you are trying to, that's the reason the, that's the reason the data with the time are usually harder to train on the machine learning model, because it has these numerical dependency when you train the model and the neural just basically helps in its.

[21:14 - 22:06] So what you do over here is you try, you first try with a thumb based search, and then basically you try to use a hybrid search and see with the values and you check with the evaluation, if it's basically having higher performance on either of the side. Once you have that data, then you can basically go to the reranking and push that retriever even further on that. Also, Michael has the conscience of time. So basically, this is what I mentioned in the systematic lab. So basically, what people do on the evaluation is they ignore the time consumption. So let me show you, let me show you my set of evaluation that I did.

[22:07 - 23:36] Like give me one option. So, so over here, if we see the evaluation with different sets of things, like the first thing or normal people would assume is basically this is the best parameters. But if you check the time, it's taking 0.470. And then if you look at over here, or over here, basically, the time is like much less on retrieving side. So what I would do basically is if time is my priority, then I would basically choose this set of this set of combination of parameters and embedding models and retrieving method. And then basically use reranking to post these 70% of score, even higher, like around at maybe at around 80 or 85. And then basically, you have a working system which has decent accuracy for the production. But if the precision is the priority, then you would have to then you have to basically give up their time. And you can choose this and then even push accuracy much higher. Oh, that makes sense. Yeah. When you calculate all those columns, like recall and precision in our map, were you using any libraries to do that?

[23:37 - 23:59] Yes, there are specific sets of libraries. You can basically go in the rag lecture. There is like I posted the rag exercises. And then basically you can just use that specific set of functions. And it will basically give you these different scores. And once you have this, so the main thing over here is we have to also calculate the time.

[24:00 - 24:46] And I don't think there is a library for calculating the time. So what I did was I did the start dot start time and the end time between the retrieving side and calculate the time for each of the configurations. I have also added the systematic way of implementing rag in rag lecture. So what maybe you can also look into that where I have discussed on how you can basically start implementing start implementing the rag without without falling with just working rag. Basically a systematic way of implementing rag is from my experience, where you basically have the certain sets of combination of parameters. You have the certain sets of embedding models.

[24:47 - 25:44] And then you have certain retrieving methods. And then basically you use this different set of combinations with different DBs with different what DBs name species. Okay. And then basically you check schools for each of these sets. And once you have the school for each of these sets, then you will realize that, okay, so this DB has the highest accuracy while it has the highest time while this one has decent accuracy. But it is like the most fastest one among them. So what you do in these cases, you choose one of those. Basically, you will have something like this. And then it's up to you to choose which one with consideration of time and the school. You know, I got to go back and look at that lecture, I think. I know I did it on the exercise, but I don't remember all these steps to do all this stuff.

[25:45 - 26:17] Yes. This is in this is basically for mini project two. So if you see in project two. Okay. Yes. And then I added an additional lecture in the lab. So if you see the highest one that the highest one that I'm getting is basically the combination of emitting model small emitting model small. And then the chunk size of 256 and overlap of 50. And then I'm using like PDF plumber. Basically, you can also choose different sets of parsing if you are having PDF.

[26:18 - 27:52] But what I did over here was instead of choosing like everything, I separated things from here onwards. So for example, I give this to chat GPT and ask chat GPT, what are some best parameters with consideration of my project? And then basically to speak to you out some best parameters considering your project. And then you use those different sets of parameters instead of doing experimentation, it's all basically the basically the max character, the max the max character or the total chunk size has the dependency on the model that you are using and the context window of that specific model. So it's like interdependent. So basically for so imagine, imagine if you use these for chat GPT 3.5, it will work if I have smaller prompt, because the context window is smaller on 3.5 turbo. But for but if I have like larger prompt, then basically in addition to that larger prompt, I had to consider this chunk size because I'm passing that in the further pipeline for the context. So I would have to choose GPT four in this case. So there are lots of dependency on what you would have to choose with respect to what model. Okay. But that's this is all stuff we do in the many projects. I haven't looked at the next one.

[27:53 - 28:31] Yes, basically, basically, this is just this is the stage of passing. After I pass the PDF with different parser, different combination of chunk size, different coming different chunk size, different pull app. And then I calculate other parameters with the chunk size and the Google app. And then basically, I asked chat GPT that I want to use 3.5 turbo. But I have, I will have this prompt in the answering pipeline, basically. And it answered that not to use the one five four seven, because I will have prompt and I might lose context.

[28:32 - 29:42] So you switch your model at that point, pick something. Yes, if I want to use, if I want to use higher model, but if I'm persistent on using 3.5 turbo, then basically, I use the rest, which is basically, maybe, which is basically, which is basically these one, the sentence three and the 512 one and the 256 one, and basically use three embed ding models from open year, which is basically these three. And then I have different sets of each two of which are these three. So now I have a total experiment of 24, because each of these will be with each of these different attributes. And then once I decide, and then once I decide, like, at this stage, once I decide, okay, so I have, I have a set of things that are working perfectly. Then I go on re-ranking, where I try to push the score even higher, and then basically, you have a complete direct pipeline working. So this is how the re-ranking works. Yes.

[29:43 - 30:33] So imagine that we are, imagine that we are in the situation where we want to use specifically this set of parameter. And if we see the result, the re-rank, like the score are better at the rate 10, and the queries are basically the one that are instinct to get the score. What you do over here is you find a cluster of query that you find the cluster of query that has specific chunk position at the highest, on the highest amount. So for example, seven might be the position of the chunk size, where you have a big cluster on seven, where most of the queries are being hit on cluster seven, seven position.

[30:34 - 30:52] So what you do is now you train a re-ranker with that seven position to push it to one or two. So whenever you have those kind of queries, where the chunk position is at seven, the re-ranking is automatically going to re-rank that seven position to one or two.

[30:53 - 31:25] Says that you can basically use that as context engineering and further answering it. So to be clear for that re-ranking, you wanted to push that one value up, so it would be more differentiated from the value that was just above it. It was like 0.72 versus the ideal value, which was only at 0.74. So basically, the reason we are pushing the position on the top one tool is dependent on two things.

[31:26 - 32:27] So let's say we have an answer generation pipeline, which takes the chunk and then answer the user query. So in there, we are just passing the first two chunk or the first three chunk, that is retrieved from the vector TV. So now our job here is to push the queries on the first three, such that the LMM can capture the context on answering, and it does not rely on its own answer generation, like at the time of the return. Any other questions? I want to talk about numbers, like, and having numbers all over your documents, and for me, all of them I just saw numbers all over the place. And I think you mentioned to me, like last week, at some point, that I should define ranges. Yes. And so I'm just wondering if there are other things that should keep in mind when dealing with numbers, just good best practices or anything like that, as I'm getting into this. Yes. For more context, can you go over?

[32:28 - 33:48] Oh, no, just in general, because Michael had issues with numbers, and so because my data is basically redefining colors, and then you're defining positions on the screen, and you're defining how much blur there is going to be on the screen, like you just find the path of a particle, like its numbers all over the place, right? And okay, we define ranges, but like, how can I nudge or make sure that there is understanding of what the numbers mean? I don't know , I'm just as well, sure, I go down a path that's, that makes sense. Yes, maybe you will have to book a meeting once you are done, like one or the team of me. The reason for that is you are fine-tuning the model, and in fine-tuning, as I mentioned earlier, the numerical values has dependencies. So neurons will always hallucinate on these numbers. Okay, so that's why I suggested the ridges of number, such that it understand the ridges as the text itself, and not the number, the specific number, the specific number in the decent part. Okay, so once you have that, then what, once you have that, then what the fine-tuning model will do is basically, it just is bad.

[33:49 - 34:10] The models are good at throwing random numbers. So it will capture the nuances of random, randomness between the ridges. And then it will basically throw the range. Now, the question over here is, will it throw, like, the range all the time? I don't know that.

[34:11 - 34:41] The reason is basically we have to experiment and check, and if it doesn't, then we have to find a way on what to do if it falls beyond that range. So that's where the validation or LLM as judge comes in. And then maybe we can have a prompt engineering on the LLM as judge, where it basically gives you the number for the specific decent parameter. Okay. Okay, interesting. Yes. Thanks.

[34:42 - 35:02] Yeah, I think I'm doing something like that where, like, I get square footage of a lot, but it can also be return as an acre. So then I have to do ranges that compare whether it might be a square footage or an acre, and then I can do my conversions from there. Yes. And I just wanted to clarify something to you.

[35:03 - 35:39] Kind of fine. The stuff you published recently about the RAG, the additional lecture on RAG, and is that what's in this one here, document processing and AI pipeline? Yes. The one that I mentioned right today. So that, so document processing is basically solely focused on the PDF parser or PDF processing, where you have a PDF and then you want to process data from the PDF. So that one is specifically for that, while the one that I mentioned was in the RAG lecture. So go back to this one. Yes.

[35:40 - 35:52] And then you can scroll down and it will be in here somewhere in the RAG. So this was the RAG lecture. Just a minute. Why it's not showing on yours.

[35:53 - 36:06] Or wasn't the Laura fine-tuning session? Yeah, it's not showing on your screen for some reason. Let me see if I can. I see. So that specific thing was not used. Can you refresh and see now?

[36:07 - 36:24] Additionally, in 400. Yeah. Yeah. That additional info on the system, so many people may often hang on. And then there is also an exercise of RAG in the exercise. Where's the exercise, sir? In the exercises. Oh, here. Yes. Oh, new homework.

[36:25 - 36:30] This one. Yes. Okay. Thank you. Was there also a, there's also another mini project, right?

[36:31 - 36:59] Yes. The mini project is basically, if you click on community. Okay. And then it's on the threads, you have to scroll down. Yeah, mini project two. Okay. All right. And then basically, you can see, click on see more. So there is a, if it's harder for you to read, then you can just scroll down. I have a comment where I have mentioned notion link.

[37:00 - 37:21] If it's harder for you to read, basically that notion link will add a Q2 proper format. You don't have to use like all of these libraries that I mentioned, but the reason I mentioned is to familiarize yourself with all of these libraries, because all of these libraries, once you start implementing advanced RAG, will be useful.

[37:22 - 37:53] Yeah. Yeah. That's helpful. Yeah. I haven't used this long fire, but the one thing I liked about LANGRA was, was that you get these, these graphs that basically kind of show you the, the steps that occur. Yes. Nice. Cause I could see exactly how long things took and whether something failed or not, what was happening. So. Yes. It was useful. Then you see the inputs and the outputs as it moves from each one.

[37:54 - 38:21] Yes. The, that is like one, one tool that is similar, which is trigger.do you. So before, before I show you that, yeah, there is also another course that I publish, which is basically EI agent pattern, like EI agent pattern. Maybe you can also look into that. That's on, it's over here. Under the courses, yes. The agent design pattern.

[38:22 - 38:48] So these basically has useful design patterns of EI agent. It's more on the theory. And then basically you can just pick the thinking pattern or the EI agent thinking pattern, which is applicable in your workflow. We are going to update this. We are going to update this with the cohort too, because the EI agent is evolving every year. Yeah. Okay. It's helpful.

[38:49 - 39:25] Dude. Have you show where this is? I think it's a lot of info and it's hard to sometimes find it. Yes. No, that's the reason we are having this call because I want like everyone , like it's not just you, it's everyone, because we have less amount of time and the information we have is like aggregation of two years. It's very common. Yeah. Thanks for asking, Michael, because there were new things there that I had seen that my original passage helped. Maybe one of the questions were you, because I'm trying to balance like homework versus mini projects versus the project itself.

[39:26 - 40:25] Like some of the more recent lectures on attention and layers, and I'll be honest, like it's like warping my brain, but it's a strange thing to write. And it's just quite honestly, I can't figure out if I really need to go to that depth right now. Like I'm looking at the age gram in gram sign that I'm just thinking, should I just focus on the rag stuff because that's going to be more relevant for me versus doing that? It seems like it'd be a better use of my time. Yes, no, totally get it because many of the people, so for example, there are many people that are in the previous code were only interested on foundational, while many of them were interested on fine tuning and just building EI application and not going too much on the foundational. So I totally understand what you are saying. So you should only focus on those things. It's not important to go over to the foundation concepts. Okay. Thank you.

[40:26 - 42:08] Until you are going on full fine tuning, because the lure of fine tuning is also it's a lie to it fine tuning. You do not need foundational concept or the lure of fine. Yeah, I do want to do the fine tuning stuff because I feel like I think the way you laid it out, the prompt engineering start there, then rag and then figure out if you need to fine tune anything. And I could see based on my last run, like I used a lot of credits with I think I hit six and a half million tokens or something. So I was like, okay, got to be hurtful. Yeah, what I'm doing. Another way you can basically solve that specific problem is maybe you can use the smaller model and host it like host your own inference on EC2 instance from EWS. Yeah. And then basically that we you do not have limitation of the token limits. And you can experiment as much as you want. And at the same time, you are just paying for hourly basis to maybe just to clarify on that. Because I was thinking about that like you had recommended like the deep seat model, I think for me, the one that does the reasoning. Yes. And I'm trying to remember it was like seven 70 billion parameters or something like that was the size. So I need to one figure out how much memory I guess I need. That seems to be the contributing factor. And then the software to run that model, I don't know whatever you call it, the engine, is it provided on hugging face? So like literally it's just execute X and it loads a server and I can just call it or do I have to have to do something different?

[42:09 - 42:53] You would have to you would have to do you would have to create your own VLMS like V LLM and VLLM framework. And then basically you would have to create your own server. But there are these models or framework, which is basically I forgot the name, but it's a GUI based hosting server. But then at the end, you still have to configure it, put forwarding and stuff in order for you to use it. Otherwise, I could use something like what's the one you call it was lambda index or lumb index. That was one of those where you said you got you could run it at night and you got cheaper compute time on router. No, no, no, open router. I thought it was a do I have to look at my notes.

[42:54 - 43:17] Lambda abs, lambda labs that was a thing. Yes, lambda labs. Yeah, that's my that's one of if you see, that's one of my most used online. Most of my framework, I'm using lambda labs. But that that essentially would effectively host a model that I could train. Is that right?

[43:18 - 44:15] Yes, you can basically force your say if you see my usage, you can basically force your own inference in the model itself. So if you see that I have, I'm using a lot. But at the same time, I'm not using any API. What I do is like for the prompt tuning exercise. So this is what I did for the prompt tuning because everyone was interested in prompt. So I created a prompt tuning exercise and it's still in the process. You guys will have no book on prompt tuning. But what I did was basically, I hosted a Mitchell seven billion parameter model on on these one. Yeah, I think the last one, the E6000 and it cost 0. PT sent per hour. And I can host a Mitchell seven billion parameter model, like on this GPU. I'm running a local LLM inference. And then I'm building a project, a complete project out of local LLM inference that I'm not using in API.

[44:16 - 44:51] But to connect to this, you go, do you go through open router or do they just provide an API of their own to hit the server? Like to connect to the company. There are two ways on doing that. Basically, so I use an asset charge on the visual studio. And then basically, I create project from the visual studio itself. Okay. So what I do, like once once I start the machine, so let's start from zero. Now you have lambda labs, you have set up like credit card and stuff.

[44:52 - 45:06] Yes, what you are. So usually, the empty machines are at night. So usually this is my way to go. Like E6000, because usually the smaller models can be posted on this one. And then you quantize it.

[45:07 - 46:17] You click on this and then basically next, next, next, and you have key, you have to download that key. And then basically you use visual studio to connect to that machine. Where you use that key, which is basically my keys, these, and then you use the IP address of that specific machine. Once you click on connect, you will basically have similar, you will basically have an access of things and you will be working in that computer via these visuals. And then basically, you start once you have once you host your own inference LLM model, you don't need any API or any. The reason for that is like you are having your own model. You can basically check on unit one or two where we are having our own inference. It's basically something like you just have to write a Python script or something that starts to load the model in the memory. And then there's some port available and you just invoke it at that point.

[46:18 - 46:30] Yes, exactly. And you would use just one of like a hugging phase LLM connections where. Yes, you have to download the hugging phase LLM model like from the hugging phase and the model.

[46:31 - 46:53] You will need the hugging phase API key and also the access of model like these are gated repos. So you would have to go in each of the model like whatever model you would like and then click on access the repo, you have to provide some information and then in authorize you the repo once they authorize you, you can download that model in here.

[46:54 - 47:18] That's right. We did all this back in the first. Yes. That was where I was blowing up my machine locally because I was trying to download these. Yes. I covered a lot of information. Dip and thank you. Yeah. Yeah. People basically right now, the simplest way of learning like a side project learning are what they do is basically they.

[47:19 - 47:51] So they go to Vietnam or all of these different places where the weather is a little cool. At the same time, the electricity cost is lower. So what they do is they host their own LLM inference model in those countries and they start selling their servers to companies abroad like in the USA or in the south, especially in the Saudis because Saudis are having a lot of funding on EI and this is what they do and they are earning like huge profits.

[47:52 - 48:09] That's funny. If I put coin mining in the system. Yes. Yes. Exactly. Similar to Bitcoin mining. Is exactly the same thing we did on Unito. I was actually thinking the other day because I was having these fire crawl issues where I think the bot scraping preventers are getting stronger.

[48:10 - 49:13] Cloudflare just sent me another email saying we've got a new thing that everybody can use to stop bot traffic. But I was thinking like if he's created an app that lets people run an agent on their machine, it's just it could be like the old Napster client, right? Like when you want to scrape a site and you get paid and some kind of token at the end of the day for letting your computer be the scraper. Yeah. That's a good idea. Like basically something and some scraper on P2P networks such that no privacy is being harmed of the person where we are. That's a good idea. That's maybe that would be my project. The tokenized incentive model for it were being a scraper. I could see so many things that people could abuse with that. Like I could just imagine. Yes. I mean at some point they would have to remove the bot protection because no people will be surfing internet. Everyone will be using LLM for surfing. Yeah. I agree. It's going to have to change.