AI Agents Patterns

This lesson preview is part of the Power AI course course and can be unlocked immediately with a single-time purchase. Already have access to this course? Log in here.

This video is available to students only

Unlock This Course

Get unlimited access to Power AI course with a single-time purchase.

$Thumbnail for the \newline course Power AI course$

[00:00 - 01:35] All right, Sunnis mentioned AI agents are programs that use LMS to interact with users and conform form tasks, the user. Yeah, that's good. Anyone else that want to guess on what AI agents are? Anyone else? Joe Percy, agents are LOM-based tools that perform predefined tasks for the user, executed by another LOM orchestrator, specific tasks, software that does things for you programmatically. Autonomous. Yeah, so these are good definitions. Basically, a lot of times, so you actually see a lot of different definitions out there. So there's actually not a consensus definition. Some people joke that if you're using LOM, so it just called an API , it's not an agent, you're just calling an API. So people in the AI community, people are kind of like arm making, but you basically see kind of different definitions. So the general kind of definition is that it's an agent that calls an outside tool and keep on looping to basically make decisions.

[01:36 - 02:23] So you can loop and you can execute the tool and there's other things as well. So along with that loop, you can have memory and you have reasoning. So if you think about a person, if you think about a robot going and executing, opening a door or invest something else, that's just a very dumb loop. But as a person, you can reason about a door. You can say, okay, I'm going to decide whether to go to the door or not. If the door is not open, I'm going to try to open it. If the door is locked, I'm going to try to do something else. So there's a consecutive kind of a loop there that's a little bit different than just acting on an API. Jeff mentioned that it's autonomous.

[02:24 - 03:10] So autonomous is one form of agent. It's not always autonomous. It can basically be a human in the loop. It can basically react to things within the API. It's not always autonomous, meaning completely without an operator. But in general, it's basically it's something that basically uses another tool. So you could argue that we already saw like a mini agent in the previous week where we had the training with the reasoning model. And actually, if you Google what are AI agents, you actually come up with three or four different definitions. There's not a it 's not besides the fact that it's a loop that does certain things. And this loop can involve reasoning or memory.

[03:11 - 03:19] There's not a huge consensus on the exact type. Anthropic has a different definition. Different people have different definitions. But the only thing that they agree on is the tool use.

[03:20 - 06:04] So what we're exploring about is agents, their design pattern, how they enable advanced lm applications, it can do tool use, it can also plan, and it also can do collaboration. So one form, like, for example, if you're by putting something, you're asking the l run, okay, help me convert things through the product requirements, help me think through the engineering can about trade offs. So those are forms of like planning. And planning basically for agents would would be different things where you're effectively telling it to construct what they're going to do first before they're actually able to come up do it. So this lecture is basically going over both architecture agents as well as valuation. I would say last year, or even mid last year, people basically were not able to get agents to fully work, basically. But like now you just have agents that best eat for work for very specific kind of domains. And yeah, so agents are autonomous kind of entities that perceive their environment, process information, and act on it for specific kind of goals. Autonomous agents use large language models based on goals and inputs. So generally speaking, it integrates with software. And you can do different things. You can do data analysis, you can do auto response responses. And so this is what people are talking about when they're talking about vertical agents. You're talking about vertical agents, for example, a CRM, a CRM that automatically monitors the email, automatically moves the pipeline, automatically creates the research and follow up to people. And so you can now basically create a lot of vertical kind of solutions that people would manually have to do with SAS software that you couldn't do before, simply because the decision boundary is fuzzy. So when you call that API, it's very clear how you basically call the API. It's a very clear boundary with agents. It's very not clear, basically. So as you example, the previous generation of Salesforce, you basically put a deal from, you put a deal basically from prospect to something that's more, you put it kind of up further in the pipeline. And that process is actually is a specific reasoning process, basically, for you. And previously, it just wasn't possible for large language models.

[06:05 - 07:08] So Jenna, I come up with principles for AI agents. It has autonomy, can operate independently, it has perception, it can sense and interpret their surroundings. So it can monitor your email, it can basically monitor your email and apply things to the CRM. It's goal- driven. It's works toward predefined objectives. And it can also learn and adapt over time. The analogy is a GPS navigator as an AI agent. It detects the vehicle location, for example, home, work, grocery store, and then calculates the efficient route to destination. We'll just back to your regular driving pattern when the door of those and it provides action. It helps direction to reach the destination. Okay, James basically asked, what is the best way to give tools to an agent? Looking through this GitHub repository, there's no right way to give it tools. There's a certain pattern that will be followed. So, yeah, we're going to talk about that later in the lecture.

[07:09 - 08:05] And there's different ways that basically apply to multiple use. Joe mentioned basically MCP. You can use MCP. So, for example, the deep research for OpenAI is a tool use where they use a web browser and then they browse through every small thing. I was talking about it with a couple of friends now. Basically, we were remarking about how much time it took to research something on Google Maps and goals for reviews. So, now with deep research with both Gemini and OpenAI, you can now basically tell it to find the very best kind of things around you as part of a research report. And so, you could argue that anything that can be done, whether it's a web crawling or other things, and you can make decisions on it and other things is very kind of valuable.

[08:06 - 09:04] And even things like research, Google research, research for a family event, for a date, whatever it is. You do have examples of AI agents where you do have chatbots that receive inputs, process the language understanding, generates human-like responses from previous responses, and then acts on it. You do have robotics basically that senses things in the environment and execute things. You have images make decisions to challenge human players in virtual environment. So, if you look at this, all of these process inputs in some way, it recognizes it in some way, and it executes it in some way. So, in the game, you execute game moves in robotics, you execute physical actions. And chatbots, you can execute combat CRM, email, or other things.

[09:05 - 10:10] So, design patterns for building intelligent, kind of efficient agents. You can generate, kind of, high quality outputs. You can use these patterns basically for agentic, kind of workflow. So, number one is you can do reflection. So, you can have an ongoing kind of ourselves assessment, ask to generate an outline, and to be able to refine using Wikipedia metadata, tool use. You can use a tool to search for news and research. So, for example, for, and then you can plan, create a debating flow between experts related to the topic, and then you can also coordinate between different agents. So, right now, what about they're talking about agents forms, basically, which is, you can see the early combat indications of agents forms through MCP. MCP, there's a MCP about server. So, let me try to find this. Let me just show you a deep research using our MCP server.

[10:11 - 10:44] Sorry, give me one second. Let me try to find it. Yeah, agent swarms. Jeff, I did say agent swarms.

[10:45 - 11:41] Agent swarm. So, this is an example of a deep research implementation from implemented from by dance, a parent company of TikTok. And so, what this basically does is, it's a agent that allows you to add an MCP server there. So, here is you add a MCP server that has GitHub trending and about things, GitHub kind of about trending items. And then , you go there, and then you say, give me a brief summary of the top trending kind of things on GitHub. And then you start about the research. So, this basically uses another MCP server as an API.

[11:42 - 12:26] And so, it's a completely abstracted away from you. So, in this sense, it's like a microservice, except the hard part of implementing everything is basically kind of done, kind of about there. So, what you basically see now is, some people are basically using MCP servers that compiles multiple kind of MCP servers to do things. And so, now it's using the MCP server, getting the 20 kind of data, and then calling the LMM to basically implement a lot of different things. And yeah, Joe put up repository kind of about there. So, agents forms, this is not exactly an agent's form, but you can see the early kind of indications of this.

[12:27 - 13:24] You can have these APIs that basically can about do things on behalf of you, and you're executing kind of things on over here. But in the future, it could be when a MCP server calls another MCP server, or we call another MCP server, and they're all working kind of back and forth in sponsors, and then pay each other credits to basically make sure the job gets done. That 's basically what a MCP can about service. And so, right now, like with the more kind of complicated back end, with microservices, and you have orchestrators that basically kind of do a lot of the job, kind of about management and orchestration. And the future, basically, it could be very different. So, that's what we mean by multi-agent communication or coordination. Agents forms are just a more, I would say basically we're probably going to get into agents forms in like sometime next year, basically. And here is basically where you can generate code and then check for correctness.

[13:25 - 13:44] So, in some sense, basically, we have been doing this about already, where we 're using the compiler, or we're using basically the underlying tool to basically check correctness already. Use a tool to execute the code and identify the errors. This is kind of a similar thing.

[13:45 - 16:07] Adopt agile methodologies by defining the associated roles and roles. This is kind of a planning, and then coordinating between the requirement generator and the troubleshooting generator. So, you can do AI combat agent workflows for enhanced progress, basically. So, how you basically start is you plan an outline, then you conduct web searches for necessary information. You write the right draft, you review the draft to identify and correct about flaws. You then revise the draft based on identify and about issues. And then what happens with this is with the improvement is much more. So, you guys all know that GPD4 is much better than GPD3.5, but you can see the performance for the game here. So, by doing this flow, by iterating through this five steps, the accuracy is 95 percent versus just doing GPD4, which is straight up with a prompt with 67 percent, basically, or about GPD3.5, kind of a chat GPD with 48 percent. And what you want to basically do is you want the agent to reflect its own output, to improve subset kind of iterations. And then you can integrate external tools, web search, code execution, to inform, we gather and process. And then you can also develop and execute mouth-aid step-by-plan tailored to specific kind of goals. And so, there's different kind of examples of this. There's also a engineering technique. It's called chain of density, basically, where you iterate, you force stuff. If you ever notice, basically, having the LLM's and all the rank things is very unsatisfying, basically. And so, you can use a chain of density, like my engineering type E, to force it to iterate and to add interesting things, you know, but to the summary, by iterating it over and over again and adding interesting kind of relevance items. So, I'm sure you guys have used this, basically, already, but like, codifying this so that it's applied at every kind of stage makes it kind of much better. So, one of you, some of you guys are basically like, "Oh, what are GPD wrappers in some sense?"

[16:08 - 16:52] And in some sense, GPD wrappers are the pure GPD wrappers, not Windsor, for both of us and these other tools. Some of these basically implement AI agent workflows. And so, the reason why they're able to get better results than a human would is a human uses chat GPT, a Google. They just type something in, whatever's in their head, and then they don't have a specific kind of a flow. And they iterate with the system, whereas as an AI agent flow, even by just specifying kind of a flow and having a reflect on its work, it improves the entire kind of output. And so, as a result, your SAS tool, or whatever it is, will come up with a better result. And this is actually the reason why people basically pay for it, as well.

[16:53 - 17:49] So, you can, reflection is just for design patterns or agentable workflow. So, the LM generates an initial kind of an output. It self critiques based on correct and style and efficiency. It incorporates feedback, and it iterates about multiple times. You can use it for coding, you can use it for writing, you can use it for many different things. And this is actually based on research, so you can basically go back into the research if you want. Let's call it self-refined, iterative refinement with self-feedback. Tool use basically involves LLMs, the capabilities to call external mobile functions. So, tool use allows LLMs to perform complex functions, such as web search, code execution, and interfaces with productivity in about tools. So , what is the best coffee maker? So, for example, deep research. What is the best coffee maker according to review?

[17:50 - 18:29] By generating a query, string, and process return data. So, one of the things that came out recently, this is like, I think last week, is that OpenAI came out with a deep research for GitHub accounts. So, one of the things that you can do is you can basically say, "Hey, what's the best way to give me all the top agent or clothes?" Basically, that use tools in GitHub. And I'll give you like a list of 20, and then you can basically use deep research, or you can use something like that's a fast indexing system augment to be able to describe the entire thing for you, how they do problem engineering, how they do RAG, and other things, and create a report for you.

[18:30 - 19:00] So, I've been doing that basically for a couple things, basically report, but generally speaking, to answer about James's question on the best way to best be used tools, the basic way to use tools is to give all the information of the tool to the LLM. So, there's two ways to come to it. You can do it like in context, which is basically, effectively, you're teaching it within the prompt, the entire tool, and then you're basically saying, "Oh, there's this tool."

[19:01 - 19:19] Basically, it has a function, it has these parameters, and then here's some documentation on it. Basically, there's obviously more of the answer than about a tool used than that, where you put it in RAG, and then you do additional web searches to give it additional context.

[19:20 - 19:31] So, that it can basically come back to it better. The problem with tool use is it's much more nuanced. Basically, so, for example, you have a tool used where it just knows the function and the parameter.

[19:32 - 20:14] That might be okay. Maybe you give it basically the documentation. It improves. But as we all know, as programmers, just knowing the function and the documentation, it doesn't necessarily, even as humans, make it so that you can use that API properly. So, for advanced and about agents, what happens is not only basically has these two, but they also incorporate error conditions and what has happened in the past for the tool to basically be much more proficient at basically calling the API. Basically, that's why you'll basically have different implementations even in the backend. And that's why AI vertical agents can go very deep in tool use.

[20:15 - 21:32] It's because if you basically give it naively basically a prompt, maybe it'll fail basically 60%, 40% of the time. And then as you basically improve the documentation, as you give it basically error codes, and tell it what not to avoid it basically, then it can basically further improve. And then with the ultimate being about fine tuning, you basically fine tune it against different about instructions. So you can effectively do instructional fine tuning basically for tools where it's this API, and then you just basically make sure that it's just calling the API in the appropriate way. So we have an instructional fine tuning example, basically in the past, and then you can use it for tool use as well. So in general, prompt engineering , rag, fine tuning are all used for tool use, basically depending on your depth of being able to execute this tool. And this is also the reason why we're basically talking about evaluation in this lecture. Evaluation is very informed because, for example, you're calling a tool, and it may seem okay basically, but it may fail 30% of the time, simply because people are using it for different use cases, and outside of your existing kind of boundaries.

[21:33 - 23:51] And so evaluation, so there was a talk online where it was a GitHub call of co-pilot, and about one of the people on the team. And he actually said that the key one of the key reasons why GitHub co-pilot succeeded was actually because of their comprehensive evaluations. And so they were able to constantly come about improve it. So advanced about you can do tool use like web search, you can code execute, you can execute a Python command and compute the right answer. You can do multi -source searches, execute around diverse sources like Wikipedia, and a research department by articles, you can manage emails, calendar industry, complex function calling, you can do summarization with not my index search, you can do math problem combo solving. So some of you guys basically are using deep research tools inside your projects, basically. But one of the things that you want to know is, so I mentioned basically that you can do use problem grab and then fine tuning. So like something like the open the deep research, open AI lead, the product lead, is now having conversations online. And so one of the things that basically came out during the conversation is how they did deep research was they trained a specific they fine-tuned all three to be a specific fine-tuned model. And they use RL reinforcement learning, like some of the techniques that we basically talked about last week. And then they were able to basically use it to train to use the web browser to be able to do things. So agents can basically go from you just give it a function to very advanced basically using reinforcement learning for it to basically learn the tool to basically apply and execute basically the thing. And so this is why even when you basically go online, you basically see a repository that says, oh, we implemented open AI deep research. You want to know what's under the foot because it can be anything. It can be rag, it can be just a prompt, it can be fine-tuned, it can be using reinforcement learning to with a fine-tun ed reasoning model. So the techniques that we talked about for index hallucination, basically, where you're training intermediate kind of reasoning steps is what they did for creating kind of deep research.

[23:52 - 24:16] And then early kind of elements on, okay, so James basically kind of asked not to get out of scope, but when making tools for agents, you have to account for negative problems. When someone tries to jailbreak the agent and try to do it something if it wasn't tried undoing, the answer is yes, you have to basically always account for kind of these security kind of scenarios, like always.

[24:17 - 27:24] And so this is actually one of those names basically for designing lm applications in is you always have to account for it basically. And this is actually why the system problem kind of at least basically. So there is a set of cybersecurity about lm people that all they do is they go to a different lm application and they should try to break it to get the system around, basically. And maybe somehow it's unavoidable, but like you still want to at least try to basically prevent against much more egregious kind of attacks. So if you remember, basically, when we did the data lecture, we had an example of RLHF conduct data set. And the RLHF conduct data set is you could basically see all the negative things like people cursing at the lm trying to make the lm do about a certain act inappropriate kind of behaviors. So you don 't necessarily have to go to all of that, but you probably do have to go into some use cases where you you prevent it from like, for example, one of the things is if you don't do it properly, the l m can do a sequel and injection attack against yourself, basically. And because the person has a free prompt, so he can basically write a prompt, write a sequel injection attack against the API where this agent is basically about calling and stuff and use the lm to basically do a recursive execution and try to use tools to basically do a brute force, basically about attack basically against them. So you do have to guard against at least some of this stuff. So early kind of early lms were relied on redefining responses to modern architecture. You do have also GPDs for a function calling feature, which enhances come about that utility. And so there are different ways for lms to interact and manipulate and about data. We do have some links here if you want to be speaking about take a look, then react is reasoning and act. It's an inference strategy for for the lms. So the reasoning is that the model actually thinks for what to do next, it asks makes an explicit action, either a tool or API call, it observes in just that actions don't about result, and it loops the cycle and until it can produce the final about answer. So this less the model both think like a chain of thought and do like an agent clothing the gap between a pure text generation and procedural about workflows, a hybrid inference strategy that interleaves reasoning kind of steps with a chain of thought with explicit acting about calls and observes the result before it can but continuing. So the reason can basically can think school next logical steps, which API should I call to get whether data execute the children action. For example, do a get on slash weather question mark city will Boston observe retrieve and parse the actions output repeat reason act observe until the final answer is ready.

[27:25 - 28:02] A new basically said they just had to call it react and back on. Yeah, that's react everywhere. So you come from what the programming react and now you get react agents. Now you basically master complex task with the man planning about design and planning in a agent that workflows. So the large language models can drive powerful agents to execute complex tasks. If you ask them to plan the steps before they act, planning in AI by agent flows involves a large language model to autonomously decide the sequence of steps required to accomplish a larger task.

[28:03 - 28:41] So an LLM used for online research automatically switch from our live web search while Wikipedia search when based with a rate limiting error during a live demo demonstrating adaptive problem solving. And so this is a dynamic decision making. So it allows for new challenges and changing environment environments and so the agent adapted to the limitation and also it can break down a higher level task into smaller level tasks. For example, converting an image of a void to similar poses. So we can do pose detection in image rearranging the bus steps.

[28:42 - 32:43] And of course, you guys know LLM's can be a little bit unpredictable. So sometimes it can take unforeseen and yet successful approaches to solve problems. And then the surprising and often delightful realization when an agent autonomously learns goals not anticipated by the user. And here is basically where you have a multi-agent collaboration basically involved with a complex science. So just before you think this is signed by people are already using multiple MCP servers basically to do like for example something like a trip planning basically if so there's trip planning around flights. Sorry, there's trip MCP servers. There's there's like Yelp MCP servers. There's some people that are basically using executing basically a three or four or five or six MCP servers to basically do a house basically. So I would say like last year this was like more hypothetical basically and only a couple people were basically about doing this. But nowadays even in about implementation, you basically see this happening where you can basically use LLM's to basically research, you know, about the requirements, you can use it to implement the engineering, provide execution, and then you can execute or provide something. So Sunnis basically asks in the same as agent combust form. So it's similar, but the vision is slightly different. So when people basically talk about agents forms, they're talking about a human even between agents. They're basically saying that every agent will basically have a token and credit system. And then they go back and forth and they basically turn about execute. So when people are executing these multi agent pattern, it's mostly one user that's basically executing it. And these agents are all adapted by the first user, basically agents form basically refer to the fact that you're basically using someone else's agent. And then you have to pay for their GPU and their agent come about resources. And that agent will basically have to pay other people basically. And so agents form is you execute a task. And then it goes out. And then it basically starts paying all these agents that they speak to different things. And then they start executing it back. And so at the end, you may pay like $10 and it would have executed through 30 or 40 kind of intermediate agents. Basically, that's more the vision of the agents form. So it's similar and slightly different definition of AI can buy agents. AI agents are autonomous software entities that use AI to perform tasks. And then we're going to explore AI agents inside enterprise. So one of the things basically for people to do AI agents is sometimes requires external about memory. So we talked about me's name, put about here. And so reasoning can basically refer to both recent models, and also just the L and, you know, reasoning. The other thing is basically they have external about memory. It requires external memory to store and recall domain specific knowledge. And the bounded context of the problem there. And so this is when people when you see memory, it's just a fancy word of vector database, basically. So now opening eye has a memory of all your chats, you know, basically. And so what I basically means is they just basically is for everything in a vector database. And agents then basically use tools to perform tasks that enhance their problem. So we can capability. And so now you're basically starting to see agent tools that do enterprise things like connect a CRM, ERP, they form also UI actions. So not just basically calling APIs, but they basically call UI actions. So one of the things when when we were researching, kind of a deep research, one of the things that surprised me was they actually trained a web browser. So deep research actually implements a web browser.

[32:44 - 33:58] And so sometimes it's basically going in. And and it's using reinforcement learning to basically kind of to do UI actions. So it's not just like a simple W get, like a, like a puppeteer kind of instance, basically planning rather than attempting solving classic complex problems using single threaded agents reflect a more human like process breaking down reflecting on process and readjusting as needed. So basically, what you see here is you see kind of a different people basically talk about agent about design, which is talk about reasoning versus external memory versus execution versus planning. And then you have different ways to basically think about it. Does it make a decision? Does it actually automate a task? And does it automate a process? So some people are making fun of agents online and they're basically saying agents without external memory and planning is just a function call with the LLM. Basically, in in some sense, it is true, but this is why you basically have different definitions. And then you can also basically use different tools to make it kind of a better base base. So open AI is a structured output, enable combat better to use.

[33:59 - 35:04] There's a open source thing called a Dantic basically that help you structure about outputs and inputs basically between a bunch of different agents. And yeah, and working through the workflow is basically could buy informants. So when agents could buy agents emerge when you place the LLM in the control flow of your application, and let it dynamically decide which action to take, which tools to take, and how to interpret and respond to input. As long as this is true, some agents don't even need to interact with external tools or take action. This is like a different kind of a definition. Basically, and this is basically kind of talking about the five levels of agents. So you could argue that basically agents basically at the very extreme is actually AGI, where it it's actually acting purely basically from its own values and its own kind of capabilities. And so what we're basically thinking about doing here is kind of about the most advanced agent use is reinforcement learning based reading model with the tools.

[35:05 - 36:00] Basically, specifically, what we did with in the kind of about the previous week, I forgot whether it's a week or a week and a half ago, basically, where we trained the intermediate from about reaching models, that's basically kind of about the state of the art. And so you could basically say, okay, what are the three types of kind of agents decision agent? You have a fixed decision tree or graph that encode business rules. And so the language model is simply to route each use case with that tree. And it evaluates which branch to follow and which node. And it doesn't invent new steps or alter the overall flow. Basically, so this is good for well scoped and role based heavy tasks. For example, claims claims processing compliance checks and KYC workflows, because this is the first thing when people basically think of LLM agents, they're like, I'm going to let it decide things is how do I control it? The way you just is you map out the actual flow.

[36:01 - 37:12] And it's a it's a tree or a graph or whatever it is. Then you basically have agent on rails. So you have a high level about objective, reconcile this invoice, refactor this code. And you have a prescribed operating procedure with a curated set of tools. And at runtime, they basically runs with the sender operating procedure, basically, and they access the current state, the memory or whatever it is. And then it acts basically based on the available tool sets or self agents. And then it applies to about checks and filters or in boats, can about the human review. And then it carries out the chosen by action. Then they repeat the cycle. This balances between safety and flexibility and making it a popular about pattern for a business process. But it also required to buffer from oversight. So at the far end, like truly unbounded combat agent, where it 's a continuous for loop, where the model handles every single step, they can back explore that backtrack and execute in a self directed way. But they're these are more experimental.

[37:13 - 38:07] Basically, the majority of the agents that people are deploying in the enterprise are decision agents or agent on rails. You obviously basically have retrieval curve about augmented generation. And retrieval often the generation is basically used as a formal kind of a memory, basically, for executing and facilitating about this, you can basically do like what's known as upon cheating. Eve, basically, is a legal research kind of a copilot. And it can decompose a high level query. For example, research title five about claims for company acts into parallel sub flows. A subflow could basically be employer background. It can be employment, kind of history and facts. It could be checking kind of about title seven statute story and about framework relevant case loss, supporting evidence for the plaintiff. So each of these have their own common chain.

[38:08 - 39:09] And then you issues a focus prompt retrieves and he's the only choice relevant to the subtopic and generates an immediate write up. So once all chains finish, you've synthes izes the output to a cochlear final memo, such that such like merging branches and get a published release. So this pattern shows how rag pipelines go well beyond a retrieve and generate step using orchestrated prompt chains to tackle complex multifaceted and st itched or resolved into a complex answer. And yeah, so what you basically see here is decomposition and then kind of a mapping it to sub taps explicitly and then merging it kind of by there. And then you can also basically use a tool to bridge the gap between rag systems and in other pulley agentic kind of systems. So for example, you have a tool registry. So this application defines a set of callable tools.

[39:10 - 40:17] So either browser base for web searches, tiny fish for web scraping, E2B for code interpretation, a non for authentication. And each tool exposes a JSON schema interface that the LM can target. And then the LM sees the available tools and chooses one based on understanding of the user's request. And then amidst a structured JSON payload matching the JSON, the tool schema, for example, you use this tool with browser base and then you do query latest kind of title seven cases. Then the system executes the API call returning the raw results, the LM ingests the result and continue reasoning for an alert tool. Like rad, the final output is it's synthesized from the tool output and any in model about generation. Amis can about calculation AI is a concrete example, the LM writes of cell formulas. And then the result flows back into the conversation, no human formula needed. So decision about agents go beyond static drag and tool use.

[40:18 - 41:03] So what you basically do is you created a directed a cyclic graph, basically, and and so this directed a cyclic from a graph is up to parse rule a rule book. So maybe you have a rule book basically that has a bunch of if then kind of scenarios that is related to compliance or about other things. And then you parse it into directed a cyclic graph. And then, and then the agent come at each node, then inspects the clinical convert documentation for straight forward chest, it may invoke a quick rag look up for complex branching, it may spin up a short subtune of France, basically. And then after each decision, the agent updates is in memory stage with the intermediate outcome.

[41:04 - 43:14] And then once the terminal node is basically kind of a reach, then it can automate what nurses basically once did an animation. So you can use this for a regulatory compliance about rule book. Like, for example, automated audits, you can use this for KYC basically workflows like dynamic identity kind of validation. The key thing is here is really directed a cyclic graph and using rag basically and then spinning up a sub change of prompts. So agent about rails on rails is basically rails are defined by your org basically. So your playbook your tool set. And so the higher level goal is reconciled this invoice with the general ledger. And then you have a natural language of rule book, the rails basically. So that's a registry of approved tools that it can use CRM's ERP kind of a and a defined way you can do. And so now you have a planning loop where you identify where you are in the process. And in your directed a cyclic graph, you you enumerate all the action items, basically, we're in code, helper agents , you then select and execute it. And then here is basically where you can also basically execute a guardrail before committing any action, you run a consistency check, you use a halluc ination filter, you use a human in the loop evaluator to double check everything. And then you repeat, reevaluate the runbook and hit the next chain. So this pattern basically requires multi agent convolutional registration. Basically, it requires memory management, basically episodic working in long term. It just different ways of using vector databases with though, and for different things. And it robots can about guardrails. It basically like double chest that works. So oft entimes, basically, what you'll see with the agent frameworks, it is you can have a 700 prompt agent about framework, basically, and maybe 80% of the prompts are designed for like double checking like sources, hallucination and other things. And these agent convolve jobs can basically run for a bit of time.

[43:15 - 43:39] So there's different people that you can basically follow for about AI updates. Basically, we had this in notion, but you can use this basically for, but some of these people basically tweet out to know about AI techniques. So you want to basically split out. So the reason why AI news seems overwhelming is that everyone is talking about a different layer of AI basically.

[43:40 - 44:24] So if you remember the very beginning of the webinar, we talked about the level one AI is basically by coding like AI automation. Level two is rag. Level three is agents . Level four is fine tuning and level five is fundamental about models. And so what you want to do if you're basically keeping track of AI is you want to see what people are basically posting, but classify them under a list basically and classify them either they're tweeting mostly about rag or fine tuning, mostly about a model or foundational model and research updates basically or or they're doing thinking about agent kind of about activity or they're doing something about different things. There's a new kind of, for example, Jason is basically talking about rag specifically.

[44:25 - 45:08] And some of these guys are basically like rag and fine tuning guys. DSP wide is is a good tool to basically do prompt optimization. And they just added a tool to optimize a reinforcement learning. So DSP wide is basically a thing that allows you to do to define a data pipeline. Basically, so this is different than the fucking base pipeline. Basically it because it allows for a automatic about execute. It does prompt optimization for you. Basically, it's been shown to perform better than just prompt engineering basically. And then they recently added a reinforcement learning module. Basically, some people are basically using it in production.

[45:09 - 45:25] I haven't personally tried it. And then there's other people that are basically working on distributed training where people are taking advantage of decent wise compute basically. So decent wise compute meaning within crypto, people are providing GPUs basically for tokens.

[45:26 - 46:45] And that's harder to basically use basically, obviously, than just going on AWS . And last, this Sunday, basically, people created a technical report on how to do async distributed about training for the first time, basically. And then you also have people there are in general and about talking about LLM and about level five and about things. Basically, they're generally talking about like, we sure can buy internals and updates, basically. And yeah, and then there's conventional AI, about updates. So these are all more internals of the, but make sure put these guys in different list, because otherwise, if you basically say, I don't really care about the latest updates for foundational model, basically, if you follow some of these guys, all they're going to be tweeting is like research papers. And some other guys are more like rag and fine tuning techniques. So it's always different. So evaluation is how well you're fine-tuned model performance on a specific task. So why are we basically talking about evaluation? People are like, this is this is a thing on testing. It's obligatory, but evaluation is actually fundamental. So you actually don't know how well your thing basically performs without evaluation. So evaluation is, it's centered around three about things.

[46:46 - 48:16] One is metrics around your model, and then metrics around your data set, basically. So what you basically see is, is models are basically coming out all the time. Like, for example, Gwen models and distributed foundational models and other things. And you might be wondering, why are these all models coming out all the time? And it's because they're better as certain benchmarks. And when you basically apply your foundational models and you find them, you're going to basically need to improve it using your evaluation and did the model actually learn what I wanted it to do. You can basically evaluate different different evaluations for classification. There's different evaluations for regression. There's also common evaluations, like a post for complexity role. And then you have a lot of about custom evaluation records. Evaluating a classification report is a detailed summary of the machine learning model in classification class. So it provides how well the model are deaths different classes, like calculating things like precision F1s for precision recall F1s for in support for each class. So the metrics in the classification is what it means in simple terms is the percentage of predictions that the model got right overall. Out of all the times the model said this is class X, how often was it? So the accuracy is basically the percentage of predictions that it got right.

[48:17 - 48:36] And so you use it when you have balanced classes and you have a general sense of performance. People basically think, oh, accuracy is just what I want, because basically people have an intuitive sense of accuracy. Basically, I had a test and I got 80 out of 100, 90 out of 100.

[48:37 - 49:43] But when you're basically using metrics and you're using classification, you actually have to look at the dynamics of the underlying data. Basically, like how often is it that how important is it that you want to make sure that the spam filtering basically makes it so that important emails still come in, basically, even though you might miss one or two of us family emails. So it really depends on the dynamics of your system so that you understand what metric is the most important. So you can't just blindly use an evaluation metric. So for example, what about precision is it says out of the times the model says this is class X, how often was it? And so it's when false positives are costly, meaning spam filtering. In a spam favorite, when you talk about case, you don't want it to put informed emails into a into spam. So you rather for one or two kind of a spam about to come in the inbox rather than basically have it filter out kind of things.

[49:44 - 50:14] Recall out of the actual access items, how many do the model time? This is where false negatives are costly. Basically, so detecting disease or fraud. So, for example , you're doing a cancer stand or you're basically doing a tumor addiction or your if you can't predict it correctly, then it's a big problem, basically. And F1 score is a balance between precision and recall. And so you want a single number to summarize the model quality when classes are in balance.

[50:15 - 51:07] For example, in balance data mean class A has more data than class B. And then here is where the number of real examples of each class in the data set. And it's best used when you want to understand the classes under represented, which affects the metrics. And here is grid that shows offense, how the model is confused for one class for another. And you need to basically diagnose exactly which classes the model is getting kind of wrong. And these are core regression about metrics. So for example, so many of you guys on projects are basically like considering regression metrics, for example, calorie counting. Calorie counting is a regression metric. Predicting ensuring damage values is a regression kind of a metric. Classification is more like we predict the car. And it's basically not doing it properly.

[51:08 - 52:06] Core regression metrics is basically where you have a mean absolute kind of error. On average, how far your predictions are from the two values. It's best used when you care about easy to interpret errors, for example, off by three units. And here is mean squared error, like means ma mean absolute error. It punishes big mistakes by squaring the error. So here is basically where you want to heavily penalize large errors. And it's good for safety critical systems mean square root mean squared error. It's similar to ma but you want to balance interpretability and sensitivity to big errors. R squared is basically where you tell how well your model explains the variation in the data. You want to get an overall sense of fit or model quality. And mean absolute percentage error is basically where you want a scale independent above you. And you want to be able to predict sales and different about currency.

[52:07 - 52:14] This is basically when adding like a classification head or a regression head. So what is perplexity?

[52:15 - 53:08] So perplexity is how well a language model predicts a sample. It's the inverse probability of the correct words normalized by the number of words. So generally speaking, lower propensity indicates a better performing model. It's like testing how well a GPS predicts the next turn on unfamiliar interval words. And it's very fast and effective for evaluating language models without needing kind of a label data. And it helps determine how well the model generalizes the unseen pass. So you can basically use a tool like lm evaluation harness. And it's a Python library to easily do unsupervised evaluation on a wide variety of language models. It's a very common it's a very common evaluation metric. This is why the the search engine is called perplexity.

[53:09 - 54:39] Basically, they they named it after about this evaluation metric. So the challenge with perplexity is it's sensitive. It's sensitive to tokenization. So different organization can cause different perplexity source. It's a perplexity only measures the probability of the next token in a sequence. And it doesn't understand the understanding of the text. And the press city can be game by models that simply memorize the training data. Rather than learning to generalize on new data. And and then the complexity is good metrics for evaluating models and on a domain about data as it can be biased for the training data. Blue Kerbal score is a precision about based metric for evaluating the already of machine generation attacks by referencing existing reference human about tax. So what you basically kind of do is you check it against human written about tax. And so it's checking the essays basically against the teachers. And so it doesn't basically care if it's a better synonym or a meter of deeper falling. So oft entimes when you see so it's called bilingual evaluation about understudy. So it it generally has a bunch of existing kind of a text in different languages. And then you just basically kind of about checks and about how well it it does against it. So blue basically does have some limitations.

[54:40 - 54:50] It it's just surface level in matching. So it was really designed for machine learning. Basically it only checks and grab about overlaps. It ignores semantic meaning.

[54:51 - 55:18] For example, a dog chased the cat versus the feline was pursued by the dog. So do sees this as a low match. Like this is active voice versus passive voice basically. And so LM can catch this basically passive versus active voice, which is technically the same meaning is about it can catch it. It also it's precision oriented. So it measures what you said not when you missed. And it penalizes diversity.

[55:19 - 55:48] So if there's multiple correct outputs, blue unfairly penalizes anything that doesn't match the reference. And it's insensitive to work out or order for short and grams. So blue one and blue two may ignore important structures like subject birth object inversion. And then you also have additional problems like, you know, it penalizes outputs that are shorter than reference, even though it's correct. And then it's also not intro not really introvertal.

[55:49 - 57:45] And by humans, it doesn't provide a clear intuition about what's good enough. Rogue is a recall focused metric that overlaps between model generated text and reference text. Unlike blue, it emphasizes basically what the model missed, not just what it got right. So rogue is good for coverage and recall, but fails to measure meaning logic or style. And it's and then use row when evaluating abstracted from about summaries and avoid road when output circuit doors semantic or require some of our interpretation limitations of road. So what you basically see here is a lot of the limitations of these evaluation about system are they don't necessarily have a semantic understanding. And they were really designed in a way basically where if the computation was basically for verifying is basically cheaper basically so that you're able to understand how well the model basically works over time. And why do you want to know the pros and cons for evaluation? You want to know the pros and cons of this is when you're using these models, basically you want to choose the right evaluation basically for your use case. And to know whether your evaluation is your model is basically performing or not. And then when you're debugging kind of about things, it could be basically like you're like, Oh, I think my model is doing well, basically. And you may basically want to see about the limitations of the evaluation and see, Oh, maybe it's actually the evaluation, it's it's over paternalizing me, basically it for my scenario. So if you look at this is a recall focus metric, basically, and then and then this is a precision based metric. So if you remember basically what we talked about position versus recall, basically, so you need to really like when you're looking at evaluation, you need to look at basically what they're looking at.

[57:46 - 58:11] This also has some of the limitations that basically it's slightly different than global in terms of limitation. It penalizes refracing as no semantic understanding. It's sensitive for tuning or it actually rewards a verbosity. Basically, it's insensitive to grammar and fluency. And it's also a little bit hard to interpret. How do you basically run evaluation?

[58:12 - 59:06] Basically, for example, you can do a get clone of lm evaluation furnace, install it. And this is basically the flags lm evaluate dash model, hf model args, which is you use a lot of months, 3.21 B. And then you basically use the task, the devices, and kind of the batch size. And here is basically where you use lm evaluate harness and you save the results. Basically, this is basically the same thing where you use the model name, you use the data set, you use CUDA, GPUs, basically, and then you output the results here. And there 's different data sets, for example, pella swag, wiki tags, you know, come up with our benchmark. And then you can use these different data sets to basically evaluate against what you wanted to do.

[59:07 - 01:00:03] So, the evaluation from a setup in our case study is basically you have a data set. You have the component kind of here is a sample about equals data set dot shuffle select. And so this basically selects subset, which is default equals 20 equals computation. And the prompt format is question and answer. Let's solve it state by step using reasoning. And then we're using a decoding method, reading decoding, we're using red jugs. And then we're predicting the answer versus the label using string equality. And then the accuracy is the correct versus total. And here is basically where the limitations are, it only works on numerical by answers. It extends the symbolic or multiple choice. And another week's necessary requires consistent formatting.

[01:00:04 - 01:01:02] And it's binary about correctness only. So we're going to go into the code basically. But this is a this is a lightweight interpretable and task-aligned custom evaluation that you guys can use in your projects. Oftentimes you'll see new models basically come out. And so on the model, they'll basically say two things. One is they'll talk about the model architecture, for example, when basically, kind of a clums out, and they say, Oh, we're a mixture of experts with tests and compute was kind of a whatever. And then the other thing is so basically mentioned the benchmark that they basically will perform better on. And so you might be wondering, why does the benchmark basically matter to me? Basically, the reason why is the different benchmarks is domain specific models that are better at math and reasoning are better at things related to data, basically financial things. So you want to basically, like some of you guys are basically doing things related to code.

[01:01:03 - 01:03:11] So many of you guys basically are doing things related to data, basically, and some of you guys are related to like tests, kind of legal texts or other things. So depending on the thing that you're basically optimizing it for, basically, then you want to take a look at the benchmarks to see how well it is speaking about does against their existing data sets. So these are basically existing data sets, basically. So one basically from code forces. And the dome is it has a bunch of algorithms, make kind of a problem. So data structure problems. And so models symmetrical in multiple languages. And it the score is generated based on the percentile of problem solved. This is that Amy basically, which is mathematics high school, it's called the American invitational mathematic examination. So it has math problems, basically, and it has multi-step high school math reasoning. And then this is at the MMOU. It's called the massive multi-task language understanding. So it's different subjects, STEM, humanities, and it evaluates zero to a few shot prompting to gauge broader knowledge. So you also basically see about SWE bench. So LMS must apply patches to real Python repo issues and pass the unit tests to count as result. Basically, it's an agency GQA statement, which is a graduate level STEM Q&A. And you have 198 about questions. And these are graduate level global proof Q&A , meaning you can't easily Google these answers. And crafted by domain experts in physics , chemistry, and biology. And here is you have mathematical about reasoning. So in algebra geometry and probability, basically designed for a multi-step and about quantitative reasoning. You will sometimes basically see a domain specific fine-tune model actually creates our own evaluation metric. And they basically say, we are our fine-tune model, our state of our in, I don't know, legal. And we carefully crafted this dataset. And sometimes you occasionally see that.

[01:03:12 - 01:04:05] And generally speaking, there's different evaluation about types. So there's the static kind of benchmarks, basically. So oftentimes when models come out, you will see them basically mention MMLU, AME kind of a different kind of a force. And so what generally kind of a measures is general knowledge of reasoning and zero a few shot skills. Oops. And then they have accuracy. And a lot of times they have ELO metrics. So ELO metric is basically from chess, where basically there's a metric that basically determines how well a model is really doing that takes into account basically the difficulty. So sometimes basically, so ELO score, sometimes people refer to the ELO score of these benchmarks as well. This class is basically low.

[01:04:06 - 01:04:17] And the equivalent is basically you could argue this is like the unit test, right? Then you have live benchmarks, basically. So you have real-time model behavior on stringing inputs.

[01:04:18 - 01:06:28] You have live bench, fiction, a live bench, EQ bench. And so this is equivalent to, and the metric is the win rate, the throughput and the latency. So this is equivalent to production monitoring, logs and metrics during live traffic, basically. And then you have domain specific suites, co-bench, Adir PolyGod. And so the equivalent of this is basically integration tests, like specific workload, validation. So you have a pass at K correctness and the win rate. Here is basically where you have a maths and stem kind of benchmarks, basically . And then it has for reasoning kind of a symbolic manipulation. And then you have, this is very high class. And then basically benchmarks around a math and logic performance. And then you have human evaluation, where you look at user preferences, basically, and check out Kanbok . And then you have a different, it's equivalent to feature flags and telemetry. And you compare variants in real-time Kanbok traffic. We put this basically Kanbok here so that you can basically have this analogy, basically for yourself, a equipment. There's other Kanbok benchmarks, which is AGI styles, basically, which is complex and multi-step reasoning. So people want to basically see how far are we from generalized intelligence, basically, then you have human evaluation, basically, which is center-round manual, cross-source, it's center-round subject to quality, foreign, CM preferences. Then you have statistical metrics like for productivity, which is how well your model predicts how that'll task. And then you have loon score and road, which is the overlap metrics, which is a reference overlap for generation to provide tasks, which is, and then so these are the different analogies, basically, AGI is like where you push the system to extreme. Human evaluation is cross-source test points. Blinting is basically where you have quick health checks and cold coverage, and I hear how much this gets. So that's basically the general idea. So here we basically have different examples of multi-agent Kanb ok demonstration.

[01:06:29 - 01:07:36] So this is a demonstration of the design kind of our patterns that we talked about, basically. So the first is a reflection Kanbok pattern, basically, which shows how a AI agent can basically use AI reflection to improve its response. And so you have the reflection, UN, or all the relevant Kanbok libraries here. And here it's basically where you run the inflection Kanbok demo. Basically, you get the name from the environment or use the default Kanbok model, and then you create a reflection Kanbok agent. So this is the reflection of an agent here. And so you basically, so let's see, the reflection agent is basically here, which implements the reflection pattern for self-reachied and iterative environment. So we're just going to go through how to use it as an SDK.

[01:07:37 - 01:08:07] Basically, you have the evaluation Kanbok criteria here. And so the evaluation Kanbok criteria here is more like a rubric-based criteria. You have the accuracy, you have the completeness, you have the clarity, you have the depth, you have the length, and then you have the engagement. And this is like an English rubric, basically, where it not only defines the item, but it also defines what is the rubric, basically, and how is it basically about doing it.

[01:08:08 - 01:09:32] And here is basically the prompt. Explain how photosynthesis work as a 10-year- old. What is machine learning? Describe the water cycle. And it generates the prompt with the response with a reflection, and it goes through the reflection process. But what it is using the reflection agent class, basically here. And the reflection agent class here is, it reflects, it does self-retreat and iterative about refinements. So here is basically where you have a function that takes a prompt and returns a response from an LLM. It has a maximum amount of iterations, basically, and then you have a template regarding the reflection process, which is what we basically kind of added here. And then you then basically, this is basically the default reflection about a template. You are an expert evaluator analyzing responses of the high quality and provide a response to evaluate, provide a detailed critique on the response based on this criteria, and then rate about the responses. And then you are basically evaluating the metric, basically, around these systems. And then you're augmenting the criteria and then generating the reflection, and then responding using reflection in array of refinement here.

[01:09:33 - 01:09:49] And then, let's see. Here is an example where you're basically kind of doing two-use.

[01:09:50 - 01:11:04] You're basically defining real tools, like, for example, you want to convert a location name to a geographical kind of a coordinates using the open metro Gcody API. And you have the parameters, the name, the count, the language, and the format. And here is the API kind of, and then you are able to kind of try the results. If the results are not there, you basically kind of return it, and then you return about the items here. So this is really a helper function where you're basically kind of getting the results. And then this is a weather kind of a helper function where you're executing the getting the current weather and forecast using the API. You guys know a lot of this, basically, and how to execute a API. And so what is giving... So if you look at this, basically, it goes into detail and specifies the parameters, basically the weather codes, and so there's a lot of detail here. So this is a helper function around evaluating a mathematical kind of a expression. This is where you extract the location from a weather query using prompt engineering. But here is basically where we actually kind of use the tool.

[01:11:05 - 01:14:10] Basically, you define kind of the tool here. Basically, you wrap it in a tool class, basically. And then you have a tool using agent with custom template is you are an AI assistant with the access to the following tool. You have the user query determine format your response as tools needed, tools named. This is the reasoning why you basically selected this given tool. And then you want to basically apply the tool to be able to help the answer about query. And then provide the parameter tool in valid kind of a JSON format. And then you're able to basically use the follow kind of a tool. And then based on the tool results, provide a comprehensive and helpful response to the original kind of a query for weather queries, include the current temperature, and any forecast for available. So now you basically have a tool use agent basically. And then you pass in the function basically that you pass in an anonymous kind of a lambda function. And then you basically add in the template, you add in all the tools, you add in the tool use template . And then you have a response to about synthesis kind of a template. So this is an example of how to use a use a tool basically. So like a simple way, obviously is basically that you you use a tool basically that is you just add a prompt, but this is a more rigorous kind of a way of dealing with this. And then you basically have sample convex queries here. What is the weather like in in New York, New York? And and then you basically process the query. And so it determines the query type. And then it selects the appropriate kind of a tool. So you extract the location using the extract location about a tool, you then get the weather data by using the weather tool. And then you generate some of the responses here, basically. And and so you basically convert goals through the results. And then basically kind of a in return to about everything, kind of a here the responses, the tool use the use tools. And then you basically about you. And then if if it detects this calculation, come up with query, and then basically goes and uses the calculation about tool. So it's just different about tool use. And it's basically using all the classes here, basically. So this is just a tool wrapper represents kind of an external tool that can be used by AI agent. And this is a agent tool using agent that wraps a function kind of a to be able to kind of do this. This is a planning about agent, basically a planning agent is where it has a plan and it executes a plan. So this is so let me just go to the bottom first, basically. So here is basically where you execute kind of about the the planning, kind of a demo. And the demo here is basically where you run goals for the model. And for example, this tool use is basically write out blog posts about the benefits of meditation and stress reductions.

[01:14:11 - 01:15:00] And so you generate the plan, create a step by step plan to accomplish this task, format this as step acts where acts is a step number, include a five to seven gear specific about steps that can result in not blog posts. And then you continue about generating the plan, you generate the tax, and then you parse the steps, you execute kind of about each step. And then you combine the results to a final kind of this is the part basically where it generates the actual plan here. You know, basically, and then prints the plan at a readable could above format. So now you basically have the plan, is it impending? Is it in progress? Is it completed? Is it failed? Is it skipped? Basically, and then you have the dependencies, you have what the status is, and whether it's failed or not.

[01:15:01 - 01:15:48] And then you go and kind of execute kind of about the plan, which is creating a blog post or whatever it is that you want to do. And then it goes through executing kind of us plan sequentially, and then and then creates it in a final combined output. So here is a research specialist, basically, obviously this is not kind of like the deep research kind of about implementation that you basically have in opening eye, but it's a it's an okay, basically research specialist. You have a research agent that's designed to basically gather information. I define a few topics that need research, provide comprehensive information, site sources when available. This is a writing agent specialized in writing and content creation.

[01:15:49 - 01:17:27] This is a critic agent, a agent in providing feedback and critique, your quality about critique. And this is a coordinator, kind of an agent, agent specialized in coordinating to about the work of other agents. And so here is basically where you execute all of this in a sequential, kind of a pattern. Basically, you get the model, kind of a name, this is an anonymous about function to generate types with the model. You now basically create the coordinating agent, the researcher agent, the writing agent, a create to accreditation. And then you use the multi-agent cover system class, basically, and it, and then, and the example basically the benefits of exercise in mental health, basically, and then it goes through and then creates the multi-agent cover collaboration. And the first thing is basically it sends a message to the coordinator, and it runs the conversation with a maximum of 10 turns. It extracts and about the final about content, and then it prints the final about conversation at summary. You can use this basically to generate blog posts. Basically, you can use this about two different things. Yeah. Yeah, so I guess that's it. Basically, you can see it. Did you guys have any questions? In the tool use example, is it just checking for keywords, like calculated or things like this, just word keywords for selecting the right tool?

[01:17:28 - 01:18:11] Yeah, it's selecting the right tool. Basically, it's, it's so that, yeah, so basically the LOM basically kind of decides to basically use a particular combo tool, and then it's aware of what the tool is basically from effectively matching basically to the, like, the registry, so to speak. So there's that very basic implementation of the registry basically about here, and where it just understands and matches kind of about things. You can implement more complex semantic combat understanding of the parameters and, like, the documentation and other things by putting all of these kind of about things in there. These are more a little bit more of a simple example of which we use.

[01:18:12 - 01:19:29] With library ideas for the agents, I tried using Lang chain and Lang craft, but this doesn't seem to be doing that. This is a customary notation. Oh, yeah. So it's in the, yeah, it's in the, hopefully, Marianne basically offered the code, basically, she didn't basically , letting them, it should be loaded basically. So the idea is to basically, like, because some of these middleware frameworks like Lang chain and other things, they become so bloated, basically, like, you don't know what's happening under the hood, basically, and then, as I mentioned, basically, or in such a new view, basically that there's not as best practice necessarily for anything. So, so when James basically asks, help, what's the thing for tool use, I'm like, depending on your evaluation, basically. So, if your evaluation basically says your tool use is basically, like, it's able to use the tool, you know, and it's accurate, like, 90% of the time, you don't have to go into a rag or feel the documentation or go into a truck. If you, if it's not working and it has some issues, then you will have to basically go into it more against techniques, basically.

[01:19:30 - 01:20:12] So, it's basically opening eyes by implementation is they, it uses a web browser and it uses a reinforcement learning by agent to be able to browse things within the using the web browser. So, that's, like, a more, much more advanced, basically, tool use, like them, basically. So, you can basically go as far as you can. And so, that's why it's important to understand, like, what's inside, basically, and because people will basically say, oh, this is an agent framework, basically, and you're like, what is it, really, what are you basically doing inside, basically. So, like, how does the registry work? What type of information are you, is it, is the lm aware of the, is it just aware of the API call, the parameters, is it more things, basically.

[01:20:13 - 01:20:32] So, yeah. Yeah, there's a lot of libraries out there, A2A from Google, and then MCP. I find the MCP much easier to use than Lancane and Landcraft. So, it's like your beta max and VHS, I think, which, whichever gets, whichever one you, is more problematic.

[01:20:33 - 01:21:22] Yeah. Yeah, MCP is gaining kind of attraction right now. People like it because you have other people basically creating MCP, go about agents, and then so you don't have to create every single agent yourself, basically. So, that's what people really like about it. And so, yeah. Percy basically asks, how do you change the main model with the agent in a query that may or may not require the agent? Wait, how do you change some main model with the agent in a query that may or may not? I'm a little confused by the question, Percy. Any clarification, Percy?

[01:21:23 - 01:22:32] You change some main model with the agent in a query that may or may not require. I'm a little confused, Percy. I can't speak for Percy, but I had a similar question, which is basically, how do you define like a graph or a flow, or even a data, a tag, as was mentioned. You mentioned multiple agents changed, like, how would you even define that, in at least specifically through the code you showed, like you said, it was a internal framework or custom code. Yeah, yes. So, you would have to define your own business process. So, you defined your own directory, basically, kind of a graph. And then you would, see, for example, you're parsing out compliance flow, basically, and then you would parse that into a business process where you can clearly know what the steps are. So, in the planning example, we have a very basic example, basically, where it goes through a status, basically, here. So, let me just re-show you.

[01:22:33 - 01:22:56] Well, you're balling that up. So, are you getting some, you're running this lambda? Yeah, you can, this is not as sophisticated, basically. So, as the fine-winning , you can run this kind of a little blade one, basically. You can also convert this into a journal book.

[01:22:57 - 01:24:26] And so, you know, this is an example of a status. So, this is a very simple example of a status, and like an enterprise, it would be much more complex, the flow, basically. So, that's why I mentioned directly, basically, graph. And so, you would basically kind of have a flow, basically, represented in a graph database, and then it would basically then use that to basically follow kind of a, and every part of the nodes, you would basically specify that this part of the node is basically kind of a important or not important, and so forth. And then it would basically, depending on what you would specify as the decision from that individual node to the next kind of a node, then you would basically, you would then be able to go to and progress to the next node. So, for example, if we were to run this locally, we could just run that planning underscore demo. I found file, and it should just-- Yeah, because it has the results from the-- Yeah. So, it has a main here. So, anytime you basically do Python space, kind of about this, basically, it will then execute everything. So, yeah. So, Percy basically, let's say the query does not match any kind of a tool, I want to use the fine- tuned amount of capability, but for other tasks, it may be a question that requires the AI agent tool.

[01:24:27 - 01:25:40] So, how do we route the-- You would basically-- Yeah, okay. So, Percy basically said, say, for example, you have an industry-found foundational model, so it's too much specific kind of about capability. And then sometimes, basically, you want it to do a deep research or kind of go on the browser. And how do we basically route the query given by the user, basically, and either use it with AI agent use or use the AI agent? So, it's based on a decision, basically. So, you basically put at the router earlier, basically, effectively planning or a decision. So, basically, a lot of times, if you look at the sequence of things that we're basically talking about , it's planning, decision, and this basically, the sequence, basically, you would put in an additional thing that's basically aware of your AI agent, or your fine-tuned model. And then, you would define the criteria and when to basically route to what, you know, basically. So, any other questions?

[01:25:41 - 01:28:21] But that looks like you're going to do an orchestrator, right? So, you would tell an AI if you've seen this. It's like a switch case statement that if you see this, execute that tool , and if it's any of anything else, then do the last one. If it's not any of them, then they just come back and say, "I don't understand what you're asking me to do." Yeah, but in this case, you would basically do more of a group-break based thing within a simple kind of an if-then, you know, combat thing. So, you could argue, basically, that you should basically, like, for a router, you could do even, basically, something like this, right? So, this is a rubric, right? Accuracy is the information, right, and factual. So, if you have something kind of in the back end, it's a fine-tuned model versus a tool use versus a fine-tuned model using a tool, you can basically say, whatever the criteria is, it can basically then reason about the query, and then basically decide to either route it to the fine-tuned model or basically route it to the, basically. So, you could basically argue that the fine-tuned model is almost like a tool, right, basically. So, if you notice, basically, what we have here, right, we define the tools, basically, using kind of these systems, it's got weather based on the location. And so, you can define this fairly arbitrarily. Basically, you can have this as an abstraction over your fine-tuned model, you know, using, I don't know, a web browser or WGet puppeteer or whatever, or you can basically use it as another kind of a system, basically. So, effectively, to answer your question, personally, it's basically, you would basically, like, effectively modify the reflection agent to basically be more like a router, and then basically use it, use the fine-tun ed model as almost like a tool, basically, like, you create a composite convection, better use your fine-tuned model or the tool. Yeah. Yeah. So, the purpose of the, if you remember, basically, one of the purpose of the course is that you're able to keep track of the AI news, and a lot of times, basically, AI news, when they come out, they refer to model architecture and model evaluation, basically. So, when you go on a website, like, for example, one reasoning model recently came out, I think it came out in last week or so, basically, the latest kind of, it 's by Alibaba, it's a very good reasoning model, basically. So, if you go on the website, it will basically talk about, like, it's benchmarks where it's good at and how it compares to other metrics and benchmarks.

[01:28:22 - 01:29:28] And if, and now you basically know the reason why people basically talk about the evaluation, it's because it's designed for different use cases, depending on your own specific use case, anti-collucination, data computation, math, basically, legal, text, whatever your use cases, you can basically use different ones of these models for fine-tuning. And that 's also the reason why we talk about the architecture, the internals of the foundational model, is every time these guys come out with something, they all always talk about their model architecture and their capabilities. Any other kind of questions? Yeah, and part of the reason why we basically did all these custom code is that you're able to see the internals and how to basically build it, because Lengqing and all these lama index, though abstracted away, and then some of those code bases can get quite large as you basically using bias TKs and metrics. So, anyway, so that's it. This lecture was a bit long. Any other questions?

[01:29:29 - 01:30:16] All right, I do have one dumb question, one dumb last person. Like, how would you define what should be an entry point for using multiplables? Also, always a prompt isn 't like something like similar that has been heard by Lengqing. What are we talking about? Sorry, I didn't catch the thing. So, what is the entry point for what? So, how do I say my problem statement requires a multi-agent approach? At least, in my perspective, I don't comprehend it that much. I get it. I regularly get multi-agents. I get it. It sounds really cool, especially when you say it's warm and stuff like that. So, for example, I'll give you an express example.

[01:30:17 - 01:30:58] Like my project, I realized that if I convert any given audio to MIDI, all of a sudden, now the problem is where it becomes considerably smaller. If you convert it to a guitar tab thing, we're talking a lot for any kind of transcription. The question, like, if I had to beat the agent into it, it's not really. It might be like a rule center graph, but yeah, so that's all. Yeah. For example, we built a blog post-generation kind of a system internally, a new line. Basically, one of the problems we ran into was the LLM, but basically artificially limit the word count.

[01:30:59 - 01:31:52] And then you also basically have a problem where you're trying to get information, and by default, it's fine-tuned in a way, basically, where it doesn't give you the most interesting information. So in order to basically be able to create a longer blog post that is full of interesting content, you now have to basically do a longer task and then it decomposes into smaller tasks. And then it summarizes some more interesting things, basically about upward. Basically, so that's just one example. Basically, another example of multi- agent kind of bus tasks is basically, for example, you're doing trip planning. Trip planning requires a agent to basically talk about go to Yelp, go to Google Flights, go to hotels, basically go to research about kind of fun things to do, basically, at that little given location.

[01:31:53 - 01:32:36] So if you think about, like, what does a travel agent do, basically? So there's actually these examples online where people are doing their trip planning using these MCP agents, and it's like, multiple MCP servers that are basically using and executing. So those are all examples of multi-agent kind of orchestration that are, like, real life. Thanks. Thank you. Any other questions?

[01:32:37 - 01:32:58] Okay. All right. Okay. All right. Thank you, guys. I'll see you guys next time. You