How to Build Agentic AI 2 (with frameworks) [Agents]

A discussion on the best practices required to Build Agentic Systems

Sep 12, 2025

It takes time to create work that’s clear, independent, and genuinely useful. If you’ve found value in this newsletter, consider becoming a paid subscriber. It helps me dive deeper into research, reach more people, stay free from ads/hidden agendas, and supports my crippling chocolate milk addiction. We run on a “pay what you can” model—so if you believe in the mission, there’s likely a plan that fits (over here).

Every subscription helps me stay independent, avoid clickbait, and focus on depth over noise, and I deeply appreciate everyone who chooses to support our cult.

Help me buy chocolate milk

PS – Supporting this work doesn’t have to come out of your pocket. If you read this as part of your professional development, you can use this email template to request reimbursement for your subscription.

Every month, the Chocolate Milk Cult reaches over a million Builders, Investors, Policy Makers, Leaders, and more. If you’d like to meet other members of our community, please fill out this contact form here (I will never sell your data nor will I make intros w/o your explicit permission)- https://forms.gle/Pi1pGLuS1FmzXoLr6

Recently, I’ve seen a lot of noise around Nvidia’s paper on Small Language Models being the future of agents. While the premise isn’t incorrect (we’ve been on the SLM train for a while), I think many people have misunderstood the implications of the research. This misunderstanding comes from a deeper misinterpretation of Agentic Architectures and how they operate/what it takes to actually build them.

While I will be doing a deeper dive into the above soon, I’ve decided to reshare our previously published guide to building Agentic AI. We shared this a while back, and I think the ideas discussed there will serve as an essential foundation to our discussion about SLMs and their role in the Agentic Economy. For this piece, I’ve taken the older guide and updated it with some images, decision making frameworks, and specific case studies that would be useful to account for when building Agents. Happy reading <3

Agentic AI has become one of the most discussed — and most misunderstood — topics in the AI space. Influencers pushing content and startups selling complexity have flooded the term with noise. As a result, most people talking about agents are either guessing or bluffing. This article is a practical guide for founders and investors who want to cut through that noise.

We’ll cover some technical ideas, but no prior knowledge is assumed. The focus isn’t implementation or algorithms- it’s clear decision-making. Whether you’re designing workflows, funding teams, or planning a product roadmap, the goal is to give you a clearer lens on how Agents are built and what matters when building reliable, scalable agentic systems.

To do so, I will build on Anthropic AI’s publication, “Building effective agents” by augmenting their findings with my own, drawn from building these systems at scale and from hundreds of conversations with real builders across the community. If you want to talk more about them shoot me a message devansh@svam.com.

A preview of some of the ideas explored today.

As always, before we get into the technical details of this piece, here is my analysis of Agentic AI, from various perspectives. I would especially direct your attention to the Investor and Policy Maker sections(a reminder- please use a big screen for the best reading experience)-

Let me know how you think we should address the possible inequality gap.

Executive Highlights (TL;DR of the article)

The folks at Anthropic were nice enough to give us a summary of their article, so I’m going to quote it as is.

“Success in the LLM space isn’t about building the most sophisticated system. It’s about building the right system for your needs. Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.

When implementing agents, we try to follow three core principles:

Maintain simplicity in your agent’s design.
Prioritize transparency by explicitly showing the agent’s planning steps.
Carefully craft your agent-computer interface (ACI) through thorough tool documentation and testing.

Frameworks can help you get started quickly, but don’t hesitate to reduce abstraction layers and build with basic components as you move to production. By following these principles, you can create agents that are not only powerful but also reliable, maintainable, and trusted by their users.”

Here are some of my extended notes on these principles-

Simplicity First: The Power of “Just Enough”

When it comes to ML Engineering, just as with love, the most effective I can probably give you is to lower your standards (if you were really worth your high standards, you wouldn’t need dating advice from internet strangers).

It’s tempting to start exploring multi-agent systems, given the marketing of how powerful these systems are expected to be. However, as with most complex software, this can lead to a mess of undebuggable code that does not work.

Instead of overengineering a system, keep your first LLM Agent extremely minimal by augmenting your base LLM with the following:

● Retrieval: The ability to access external knowledge bases.

● Tools: The capacity to interact with external services (APIs, databases, etc). A lot of people want to use LLMs for very simple tasks (such as arithmetic, chaining instructions, simpler ML/AI-based scoring, etc), which would be better left to external functions.

● Memory: The skill to retain information across interactions.

This will speed up the velocity of your iterations (always easier to build up than refit), drop your costs, and make debugging much easier. It’s also a good way to build more trust with your users (since they’ll be exposed to fewer errors). For a practical demonstration of this concept, take a look at this comparison of Devin and Cursor. Cursor does fewer things (it’s closer to the above architecture) but is still preferred since it has fewer errors, higher speed, and better user experience than Devin (based on a more powerful multi-agentic, A2-style architecture to enable more functionalities). Simplicity has many benefits, both external (with users) and internal (for your own teams)-

Within this research setting, we found that differences in architectural complexity could account for 50% drops in productivity, three-fold increases in defect density, and order-of-magnitude increases in staff turnover.
- How bad is Architectural Complexity

Don’t be scared to throw out a simple Agent system that handles a narrower scope. If you design the interactions well, it will still help your users solve their problems, which is ultimately why Tech is built.

Simplicity also helps us attain the next principle-

Transparency: Shedding Light on the AI’s Reasoning

Somewhat ironic that Anthropic put this as a principle, given how little LLM companies actually do for this, but I’m not going to throw shade here.

Trust is a critical factor in any AI application. If a system makes a decision without any explanation, users are unlikely to trust it, which means less usage, harder future improvements, etc. That is why transparency is a critical component of an effective agentic system.

● Debuggability: When something goes wrong, a transparent system allows for quick diagnosis and course correction (a luxury not afforded to more black-box systems).

● User Trust: Openly demonstrating the reasoning of an agent encourages adoption and reduces user skepticism.

● Ethical Considerations: A transparent agent is also an ethical agent since it can be easily scrutinized and modified as necessary.

The trust aspect is one of the underappreciated why RAG systems have picked up so much steam and toppled more black box techniques like Fine Tuning and Prompt Engineering as the primary approach (another prediction that aged extremely well)-

Source. I’ve been advocating for RAG since 2022, ever since I saw Meta’s Sphere. I’m shocked Meta never tried to monetize it, even after the early ChatGPT hype. If someone can tell me why, I’d love to understand it.

For a practical example of transparency being a good sell- the users of IQIDIS have loved the document citing functionality, which allows them to easily verify any claims/arguments-

Image Caption- Market Estimates are in the table (confirm for yourself here). I’m using a non-legal Document + example for confidentiality reasons, but if you’d like a demo for your Legal Documents, contact IQIDIS.

Certain users also get access to the underlying embeddings and representations to allow them to see how the AI views the relationships between various entities in a given set of case documents. Below is an example for the Fannie Mae Regulations, showing how our AI understands the relationship between Desktop Underwriter and Form 1003 (the users are able to edit the connections/relationships, allowing their AI to customize it’s outputs based on their unique knowledge).

By letting our users both verify and edit each step of the AI process, we let them make the AI adjust to their knowledge and insight, instead of asking them to change for the tool. It also reduces the chances of an AI making a sneaky mistake that’s hard to spot because every step of the process is laid bare.

Both these features have been huge sellers for IQIDIS, since this allows our users to understand why the AI came up with a certain argument/claim, letting them trust the product more (a lot of the productivity gains from Gen AI are wiped out because the users spend more time verifying/debugging outputs).

Another huge sell for our system has been the thinking logs that users can use to understand why our system comes up with it’s answers and what we can do about it.

All of this is tied together by the final principle: good design.

3. The Agent-Computer Interface (ACI): The Unsung Hero

Human-computer interfaces (HCIs) are an essential component of software. It’s important to apply the same degree of attention to how the AI agent interacts with its tools. It’s not enough for an agent to be capable; the agent needs to use those capabilities reliably. We call the design of systems that ensure that AI Agents can effectively interact with their environments ACI.

● Tools are Critical: Tools are not a “nice to have” but a core aspect of effective systems.

● Well-Defined Tools: Tool definitions should be as comprehensive as possible, clearly specifying input and output formats, edge cases, and boundaries, and have clear examples of use. This guides the AI to better neighborhoods, leading to fewer errors.

● Thorough Documentation: Same point as above, documentation helps your Agents pick tools better. It also provides great feedback for future improvements.

● Iterative Testing: Your tools can get caught up in huge loops. It’s important to test this both to ensure it works when it should and doesn’t when it shouldn’t.

● Poka-Yoke your tools: The article’s suggestion of “poka-yoke” (mistake-proofing by adding constraints) your tools is fantastic. Tools should be designed so that the agent cannot easily misuse them.

Since the authors mentioned SWE-Bench, I think it’s good to use that as an example of this principle. The main section of ACI will cover how ACI enables high performance on SWE-Bench.

“ACI has been achieved internally: How to solve complex software engineering tasks with AI-Agents”

Besides this, another very important thing I found was the decomposition of Agentic AI Systems into different kinds of systems. Firstly, we have the workflow-based Agentic Systems, which send LLMs, their tools, and the other components of their systems through well-defined processes. Then we have Agent-Based Agentic AI Systems, which give LLMs a lot more freedom in when/how to call tools. These are good for different things-

“When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.”

Personally, I thought this was the most important idea they discussed. The article only gives a few sentences of attention to this concept, so we will start by exploring it in a lot more detail, with my thoughts on how these differing subtypes are better for different kinds of development. This is why the main section of our deep-dive will start here.

The main section will cover-

What is Agentic AI.
Building on workflow-based Agentic AI vs. agent-based Agentic AI.

A deeper dive on how to build better Agent Computer Interfaces + the case study mentioned. A very special shoutout to Mradul Kanugo, Chief AI officer at Softude for sharing his insights on this (Bro is an absolute AI Fiend, so if you want to learn more about the field- reach out to him on LinkedIn over here or follow his Twitter @MradulKanugo).
Practical Strategies for composing Agentic Systems (different kinds of architectures you can apply).

Keep reading if this interests you.

I put a lot of work into writing this newsletter. To do so, I rely on you for support. If a few more people choose to become paid subscribers, the Chocolate Milk Cult can continue to provide high-quality and accessible education and opportunities to anyone who needs it. If you think this mission is worth contributing to, please consider a premium subscription. You can do so for less than the cost of a Netflix Subscription (pay what you want here).

I provide various consulting and advisory services. If you‘d like to explore how we can work together, reach out to me through any of my socials over here or reply to this email.

So… what is Agentic AI

Agentic AI isn’t about personality, sentience, or any of the usual sci-fi fluff. It’s a systems design pattern. An agentic system is one that breaks down a complex task into structured, auditable steps and executes those steps with some degree of autonomy.

To quote our earlier exploration on my favorite model for AI Orchestration- “The way I see it- Agentic AI Systems decompose a user query into a bunch of mini-steps, which are handled by different components. Instead of relying on a main LLM (or another AI) to answer a complex user query in one shot, Agentic Systems would break down the query into simpler sub-routines that tackle problems better…

Agentic AI is a way to ensure your stuff works by making the AI take auditable steps that we can control and correct. This gives us better accuracy, fewer uncontrollable errors, and the ability to tack on more features.”

Caption- To reiterate Agents can have amazing ROI only if you know how to build them.

Anthropic further differentiates Agentic AI Systems into two subtypes of architectures-

● Workflows are systems where LLMs and tools are orchestrated through predefined code paths.

● Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks. I’m going to call this A2, to prevent confusion with Agentic AI the overarching design philosophy (which contains the workflow subtype as well).

Workflows are system teams like prime Barca (or any of Peps teams)- where the vision and plays are laid out meticously. Players do get creativity on the field, but it’s creativity to implement the vision already laid out for the team. At their best, they are operate with extreme dominance and efficiency, since their pre-defined vision allows them to optimize every detail of execution. However, they run the risk of becoming stale and robotic, and often fall apart if a key component collapses (see Man City and Rodri).

A2, on the other hand, is like Don Carlo’s Real Madrid. We add in a lot of high quality components, define the loose overarching goal, and then let the components figure it out through the magical power of Friendship and Vibes. At it’s best- you get an extremely flexible system which can handle a wide set of challenges, and will often surpass your expectations (how many times does Peak Madrid pull off comebacks when nothing seems to be working). However, these systems also tend to collapse spectacularly, at which point diagnosing issues and fixing them can be much harder w/o rebuilding significant components (see again Madrid and their difficulties on the Wing).

To me, most good agentic AI systems lie on a spectrum that borrows principles from both these subtypes. Personally, I have a strong bias towards defining most of the processing workflow as rigorously as possible (80–90% of the processing is done through defined protocols), with the dynamic tool calling and creativity only being applied for last-mile delivery. This allows me to build systems that are more reliable and precise, which comes from my background in building performant ML Systems in low-resource, high-noise, and high-scale environments. The additional reliability and precision allow me to control the outputs and account for errors, ensuring-

Users get the outputs they want in the way they want (traditional LLMs and A2 are horrible for precise tasks like Precise Formatting, building case argumentation by following an elaborate strategy, etc.).

Points 2 and 3 are acknowledgments of how hard these simpler (but high precision) tasks can be for LLMs and why specialized setups are key for them.

Costs are kept very low.
The cost of failure can be minimized since checks are easier to build. Costs of A2 systems can spiral out of control if they get stuck in non-helpful loops or trigger a random error- a no-no in a consumer-facing SAAS platform.

Of course, this has its own drawbacks- mainly that you need a lot of domain knowledge to set up a workflow-oriented Agentic AI system effectively. IQIDIS pulls this off by being aggressively lawyer-led, where the CEO and advisors map out every possible decision/step taken by a lawyer in a case to identify the highest ROI steps. Some of our (SVAM’s) other clients do so by limiting their scope through constraining input, disclaimers, and other such practices.

From a solution development perspective, workflow-heavy agentic systems are best for vertical plays since they can pass on cost savings to their users while maintaining high performance. A2-based systems are better for horizontal plays since they are inherently more flexible. It’s not a coincidence that most frontier AI Research is biased towards A2 since it’s done by tech people, not domain experts, and thus focuses more on general applicability. By itself, this is fine, but too many AI teams building Vertical AI Products (AI for X) blindly copy the process of frontier AI Labs, not appreciating that they are playing a very different game.

Going back to Legal AI (a field I’m studying a lot now due to my work with Iqidis)- this is why we haven’t had a great new Legal AI product despite the billions of Dollars raised by startups (and incumbents). Based on my testing, the players seem to be following the wrong playbook for improving results, which leads to mediocre outputs and much higher cloud computing bills-

Source. With that amount of money, Legal should have been completely transformed. It hasn’t because the big players love setting money on fire by investing into the wrong approach.

This is likely also true for other fields like Finance and Medicine, but would love to hear your experiences with this.

We have multiple deep-dives and discussions on simplicity and the importance of transparency. So I’m going to skip those sections here to not bore you too much. Let’s discuss ACI, which is not something I touch on much.

Most people obsess over model quality, prompt tricks, or which framework to use. But in production systems, those aren’t the limiting factors. The biggest performance delta often comes from how well the agent interacts with its tools, and that comes down to the Agent-Computer Interface (ACI).

Think of ACI as the operating layer between the LLM and the real world:

● It defines what actions are available.

● It shapes how those actions are called and interpreted.

● And it governs what feedback the model gets back.

Get this wrong, and even the best model will fail in unpredictable, inefficient, or costly ways.

How to Build Effective ACI?

When it comes to building effective interfaces for agents, here are some key properties to consider:

Simplicity and clarity in actions: ACIs should prioritize actions that are straightforward and easy to understand. Rather than overwhelming agents with a plethora of options and complex documentation, commands should be concise and intuitive. This approach minimizes the need for extensive demonstrations or fine-tuning, enabling agents to utilize the interface effectively with ease.
Efficiency in operations: ACIs should aim to consolidate essential operations, such as file navigation and editing, into as few actions as possible. By designing efficient actions, agents can make significant progress toward their goals in a single step. It is crucial to avoid a design that requires composing multiple simple actions across several turns, as this can hinder the streamlining of higher-order operations. If you see a long composition consistently, write a routine that runs it one command.
Informative environment feedback: High-quality feedback is vital for ACIs to provide agents with meaningful information about the current state of the environment and the effects of their recent actions. The feedback should be relevant and concise, avoiding unnecessary details. For instance, when an agent edits a file, updating them on the revised contents is beneficial for understanding the impact of their changes.
Guardrails to mitigate error propagation: Just like humans, language models can make mistakes when editing or searching. However, they often struggle to recover from these errors (going into unproductive loops is surprisingly common for LLM based systems). Implementing guardrails, such as a code syntax checker that automatically detects mistakes, can help prevent error propagation and assist agents in identifying and correcting issues promptly.

Let’s look at a case study to see how these principles can implemented IRL-

How SWE-Agent Uses Effective ACIs to Get Things Done

SWE-Agent provides an intuitive interface for language models to act as software engineering agents, enabling them to efficiently search, navigate, edit, and execute code commands. The system is built on top of the Linux shell, granting access to common Linux commands and utilities. Let’s take a closer look at the components of the SWE-Agent interface.

Search and Navigation

In the typical Shell-only environment, language models often face challenges in finding the information they need. They may resort to using a series of “cd,” “ls,” and “cat” commands to explore the codebase, which can be highly inefficient and time-consuming. Even when they employ commands like “grep” or “find” to search for specific terms, they sometimes encounter an overwhelming amount of irrelevant results, making it difficult to locate the desired information.

SWE-Agent addresses this issue by introducing special commands such as “find file,” “search file,” and “search dir.” These commands are designed to provide concise summaries of search results, greatly simplifying the process of locating the necessary files and content. The “find file” command assists in searching for filenames within the repository, while “search file” and “search dir” allow for searching specific strings within a file or a subdirectory. To keep the search results manageable, SWE-Agent limits them to a maximum of 50 per query. If a search yields more than 50 results, the agent receives a friendly prompt to refine their query and be more specific. This approach prevents the language model from being overwhelmed with excessive information and enables it to identify the relevant content quickly.

The above is particularly important when we look at research like Diff Transformer, which explores how often Transformers overemphasize useless information.

Transformer often over-attends to irrelevant context (i.e., attention noise). **DIFF Transformer amplifies attention to answer spans and cancels noise, enhancing the capability of context modeling.**

This is one of the reasons I don’t really put too much weight on an LLM’s context window since they often fall apart when we start to hit higher token counts (context corruption is a real problem). The best way to reduce inaccuracy and increase generation quality for LLMs is to reduce the amount of information they have to look at.

Claude Sonnet 4, GPT-4.1, Qwen3-32B, and Gemini 2.5 Flash on Repeated Words Task — This phenomenon is called Context Rot, and it hits models for reasoning much harder than Needle in a Haystack tests show.

File Viewer

Once the models have located the desired file, they can view its contents using the interactive file viewer by invoking the “open” command with the appropriate file path. The file viewer displays a window of at most 100 lines of the file at a time. The agent can navigate this window using the “scroll down” and “scroll up” commands or jump to a specific line using the “goto” command. To facilitate in-file navigation and code localization, the full path of the open file, the total number of lines, the number of lines omitted before and after the current window, and the line numbers are displayed.

The File Viewer plays a crucial role in a language agent’s ability to comprehend file content and make appropriate edits. In a Terminal-only setting, commands like “cat” and “printf” can easily inundate a language agent’s context window with an excessive amount of file content, most of which is typically irrelevant to the issue at hand. SWE-Agent’s File Viewer allows the agent to filter out distractions and focus on pertinent code snippets, which is essential for generating effective edits.

File Editor

SWE-Agent offers commands that enable models to create and edit files. The “edit” command works in conjunction with the file viewer, allowing agents to replace a specific range of lines in the open file. The “edit” command requires three arguments: the start line, end line, and replacement text. In a single step, agents can replace all lines between the start and end lines with the replacement text. After edits are applied, the file viewer automatically displays the updated content, enabling the agent to observe the effects of their edit immediately without the need to invoke additional commands.

SWE-Agent’s file editor is designed to streamline the editing process into a single command that facilitates easy multi-line edits with consistent feedback. In the Shell-only setting, editing options are restrictive and prone to errors, such as replacing entire files through redirection and overwriting or using utilities like “sed” for single-line or search-and-replace edits. These methods have significant drawbacks, including inefficiency, error-proneness, and lack of immediate feedback. Without SWE-Agent’s file editor interface, performance drops significantly.

To assist models in identifying format errors when editing files, a code linter is integrated into the edit function, alerting the model of any mistakes introduced during the editing process. Invalid edits are discarded, and the model is prompted to attempt editing the file again. This intervention significantly improves performance compared to the Shell-only and no-linting alternatives.

Context Management

The SWE-Agent system employs informative prompts, error messages, and history processors to maintain the agent’s context concise and informative. Agents receive instructions, documentation, and demonstrations on the correct use of bash and ACI commands. At each step, agents are instructed to generate both a thought and an action. Malformed generations trigger an error response, prompting the model to try again until a valid generation is received. Once a valid generation is received, past error messages are omitted except for the first. The agent’s environment responses display computer output using a specific template, but if no output is generated, a message stating “Your command ran successfully and did not produce any output” is included to enhance clarity. To further improve context relevance, observations preceding the last five are each collapsed into a single line, preserving essential information about the plan and action history while reducing unnecessary content. This allows for more interaction cycles and avoids outdated file content.

To end this piece, let’s talk about some techniques that can allow your agents to handle more complex tasks.

Techniques for Building More Powerful AI Agents

Once you have the core building blocks (LLM, tools, retrieval, memory), you can explore patterns for greater sophistication. Anthropic’s article has very snappy summaries, so I’m going to leave them as is-

Prompt chaining

Decomposing a complex task into a sequence of prompts. A way of making it easier to handle and not start to identify where things are collapsing (seperation of concerns, but with prompts)

This is ideal for situations where the task can be cleanly decomposed into fixed subtasks.

Examples where prompt chaining is useful:

● Generating Marketing copy, then translating it into a different language.

● Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.

Routing

Routing classifies an input and directs it to a specialized follow-up task. This workflow allows for the separation of concerns and the building of more specialized prompts or relying on other components like Classifiers, Rule Engines, etc. Without this workflow, optimizing for one kind of input can hurt performance on other inputs (as is often seen in AI safety settings).

Below is an example of a more complex router, that one often sees in RevOps circles(my first real project in LLMs was something similar- taking English queries from non technical users and creating multi-dataset SQL queries based on them).

The Different Agentic Patterns is an excellent deepdive into this topic

Parallelization

“LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:

● Sectioning: Breaking a task into independent subtasks run in parallel.

● Voting: Running the same task multiple times to get diverse outputs.

When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.

Examples where parallelization is useful:

Sectioning:

● Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.

● Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.

Voting:

● Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.

● Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.”

I’m guessing a lot of the inference compute scaling (giving more compute budgets to inference tokens as opposed to training) will be leveraging this heavily. I’m a bit skeptical of this, given how expensive this is likely to get.

Orchestrator-workers

In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.

I didn’t love Anthropic’s diagram for this, so here’s a better one.

As mentioned earlier, this is my favorite way of building more complex systems. I like the control this gives me. RAGs are a good example of orchestration (although they might not seem like it), especially if you involve Intent Classification and more complex kinds of retrieval. For example, at IQIDIS we have use specialized agents for user profiling and personalized output generations, context aware retrieval, and much more. Some details on how we do the latter are covered here.

Evaluator-optimizer

“In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.

When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.

Examples where evaluator-optimizer is useful:

● Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.

● Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.”

I like LLMs as judges, but it’s important to set things up well to avoid a lot of the problems mentioned earlier. It’s a delicate balancing act to pull off.

There’s also some interesting writing on Agents for more autonomous work, but personally I don’t buy it as a useful pattern. I think there is too much variance for me to recommend it to anyone. If you disagree, I’d love to hear your thoughts on it.

Final Thoughts/Conclusion- Architecture as Strategy

Agentic AI isn’t a model upgrade. It’s a shift in how LLM-based software gets built.

It forces teams to move from prompting to programming; from black-box speculation to system-level clarity. When done well, it replaces guesswork with structure and forces stochastic models into deterministic pipelines.

But that only happens if you treat architecture as strategy.

If you're building:

● Design for control before creativity.

● Build interfaces your agents can’t misuse.

● Make every step observable, testable, and interruptible. The last point is particularly valuable if you’re working on

If you're investing:

● Don’t just ask what the product does. Ask how it works.

● Look for system thinkers, not just prompt tinkerers.

● Ignore novelty — fund reliability, constraint, and composition.

This is where the next moat gets built. Not in having access to better models — everyone has that. But in building agents that don’t break, don’t drift, and don’t get in their own way.

When that happens, AI becomes more than a feature.
It becomes a new layer of product thinking. A new surface for strategic advantage.

Thank you for reading, and I hope you have a wonderful day.

Dev <3

If you liked this article and wish to share it, please refer to the following guidelines.

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow. The best way to share testimonials is to share articles and tag me in your post so I can see/share it.

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

AI Newsletter- https://artificialintelligencemadesimple.substack.com/

My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/

My (imaginary) sister’s favorite MLOps Podcast-

Check out my other articles on Medium. :

https://machine-learning-made-simple.medium.com/

My YouTube: https://www.youtube.com/@ChocolateMilkCultLeader/

Reach out to me on LinkedIn. Let’s connect: https://www.linkedin.com/in/devansh-devansh-516004168/

My Instagram: https://www.instagram.com/iseethings404/

My Twitter: https://twitter.com/Machine01776819

Leonidas Raghav

5dEdited

Great article! I work on enterprise agents, and this piece lays out really well the issues and strategies my team has run into. We’ve found there’s a spectrum between generalisability and reliability: workflows are precise and more robust but rigid and domain-specific (almost like traditional software), while agent-based setups are more flexible but prone to errors and reliability issues.

What’s worked well for us is actually to start agent-first: give the model a set of tools and observe how it behaves. That exploration phase surfaces the paths that really matter, which can then be hardened into workflows for better reliability and cost savings, while still keeping agent flexibility for edge cases. In practice, this can strike a good balance between generalisability and reliability.

Expand full comment

1 reply by Devansh

1 more comment...

Artificial Intelligence Made Simple

Discussion about this post