Agentic AI: Recipes for reliable generative workflows
Posted: 2025-10-17
This is part of a series I’m writing on generative AI.
Introduction
This text describes working recipes that I use to apply generative AI for various tasks, including writing software. These recipes have enabled me to make my AI-assisted workflows more robust.
Problem decomposition
Constrain the scope
Only give relatively simple problems to an AI conversation. You’ll calibrate over time, building an intuition of where the threshold is, as you see AI either struggle with or gracefully sail through the problems you feed it.
For example, I initially tried to have a “review this code” prompt, with the hope that I could maintain all my style requirements in a single file and have AI flag deviations. That was madness. Instead, I break this down into many separate tasks, each focused on a very specific style concern. Here are two examples of how I generate single-concern prompts:
- Duende code-style prompts: Guidelines for independent code-sytle concerns.
- Recipe principles: Contains principles that I expect my cooking recipes to adhere to.
Avoid information overload
As conversations grow, performance degrades incredibly quickly. This is linked to LLMs’ tendency to show recency or middle-drop bias, where information in the middle of the context window is often ignored. I start getting uneasy around 50 to 100 messages per conversation (though, obviously, the actual number depends on a lot of details).
There are various strategies you can use.
Break down units of work. For example, when generating unit tests, spawn separate workflows for each individual test (rather than a single large “write me unit tests for this class”).
I actually have conversations whose only goal is to prepare the prompt for another conversation whose only goal is… to prepare the prompt for another conversation. It looks something like this (with a lot more details):
Initial request: find relevant files. “A programmer is going to write code for {{specification}} What files should they read to know everything they need?”
Summarize relevant files. For each file: “A programmer is going to write code for {{specification}} Can you read {{file}} and provide a summary including everything that the programmer should know (in order to implement the block of code)?”
Additional APIs. “What APIs (functions, including class methods) would be relevant for a programmer implementing {{specification}}?”
Give an example for each API (one conversation for each): “For this particular API {{function}}, can you give a brief example of usage that would be relevant for a programmer implementing {{specification}}?”
All of this is just to prepare the context for the final step: “Given all this {{context}}, implement {{specification}}.”
Simple multi-agent strategies
Build multi-agent strategies, but keep them simple.
Most of the time I implement a strictly linear flow: a sequence of conversations, where the output of one can only be fed to future ones.
The few exceptions to this are:
A simple “review” phase –where an older conversation is resumed with a message like “your solution has these issues, please fix: {{feedback from other agent}}.” However, I’ve also had good results by simply starting a new conversation focused only on addressing the style concerns (throwing away all the previous context and investigation that was required to implement the solution, to manage token count and avoid recency bias).
An “ask” command where a conversation can fork a separate conversation focused on answering a specific question. I have only had mixed results with this; this mechanism is only very sporadically used. But I still suspect this has potential.
My general principle is to keep workflows as simple as possible (beyond purely linear flows).
Robust tools
Validation is critical
Validation is critical. For any step where it’s feasible, always have validation.
Preferably deterministic validation (i.e., “does the code compile”, “did the agent actually write the file it was supposed to write”, etc.). Feed the errors back from the validation and force it to keep going, or start a new conversation (see “Avoid information overload”).
Failing that, run another set of agents to confirm the outputs of the first.
Silent validation
When I started, I made validation explicit: the AI is given a
validate function and told to run it to confirm progress.
The validation gives a “success!” message or detailed error
information.
I now prefer to make validation an implicit property of the system, somewhat transparent to the AI (until the moment when it should surface). This prevents the agent from being distracted by the validation process itself, making success an invisible default state.
I run validation when the AI runs the
donefunction that signals that it thinks it has accomplished the goal. If the validation passes,donenever returns (and the AI may never learn that I validated its work); otherwise, the AI gets an error report with the failures.Whenever the AI changes the environment (e.g., writes a file), I run the validation without telling the AI. Only if the validation state changes (previously passing to now failing, or vice versa), I include a message about this, either:
“The validation command is now passing.”
“The validation command is currently reporting failures (normal if you are in the middle of applying changes). To see the failures, use:
validate”
Constraints can increase focus
Exposing fewer functions helps the AI avoid getting lost trying irrelevant things.
For some tasks, an AI given “read”, “write” and “search” functions outperforms one given also “run arbitrary shell commands”.
There’s an API granularity trade-off, a balance between general APIs (e.g., read and write file) and custom APIs for the task at hand (read a specific function within a Python file given an identifier). Specific APIs sometimes help (e.g., let the AI focus directly on the relevant parts of a file rather than having to read/write its entire contents), but they can also confuse the AI: they’ll have more parameters (than simply the path to be read, in our example). Each new parameter is a potential source of confusion.
Justification requirement
I explicitly require a brief justification (as an argument to API functions) for each invocation. I believe this increases the success rate.
Argument(
name=VariableName("reason"),
arg_type=ArgumentContentType.STRING,
description=(
"Brief (one or two sentences) explanation "
"of why you are running this command "
"(what you want to accomplish)."))
This helps to make the intention more explicit. As a bonus, it leaves a nice trace of intent, which can serve as a nice overview of what the conversation did.
Better prompts
Pre-bake context
Force-feed initial information before you ask for action. Various things that helped drastically reduce hallucinations:
I used to start my prompts with direct “before you do anything else, read these files: {{list}}” instructions.
These days I include the files’ contents directly in the initial prompt. This not only saves cost and latency by avoiding AI steps (function calls to read the files) but, more importantly, ensures that all relevant information is present for the first token generation, reducing initial-step hallucination.
Focused task framing to mitigate hallucination
For my “evaluator” conversations, I initially formulated the task as “identify reasons why {{input}} should be rejected according to {{very detailed guidelines}}. If the input meets the guidelines, don’t output any reasons.” This led to many hallucinated reasons (AI tried very hard to follow my instructions and come up with reasons).
I got much better results when I switched to formulating the task as: Evaluate whether or not the input meets the guidelines and decide: either accept an input (and explain why), or reject it (and explain why).
Require plans before execution
For non-trivial tasks, ask the AI to describe how the task can be achieved, at a high level, before asking it to follow the plan and achieve the task. Most of the time I don’t intend to read those explanations (they are part of my automation). This reflective prompting technique encourages the AI to focus on formulating a high-level plan before haphazardly committing to action.
When generated tests fail to compile (probably because the AI got details of the API wrong), a common failure scenario is that the AI starts blindly guessing, hallucinating APIs.
I got better results simply by adding messages like the following to my automation: “Please tell me which file contains details that would help a programmer understand this failure?” (which should be easy to answer given the context). In the response, I just include that file.
Running workflows
Agent failures are system design flaws
When a conversation goes wrong, view it as feedback on the system design, not on the AI’s execution. Having to guide a conversation manually (“don’t do it that way, you are missing the fact that…”) is a sign that I’m doing something wrong. Workflows should be robust enough to not require manual intervention.
Instead, I try to find the root cause of problems. Often I find issues in one of these areas:
Problem decomposition: I am giving too large a problem to a single conversation.
Prompts or tools: My instructions are ambiguous or even conflicting. My intent is implicit.
Tool or system design. Occasionally, I improve my APIs to match the “natural” expectations embodied by the AI’s hallucinations. Sometimes, when I’ve seen AI struggle to write good unit tests, I’ve looked at the problems as feedback that has actually helped me improve my interfaces.
Restart it
Sometimes you’re just unlucky and conversations go wrong. As they start going sideways, the probability that they’ll recover dwindles.
Sometimes I didn’t do anything wrong (that I can see).
In these cases, I just restart the conversation (ideally with a single click on my UI, starting it again from the original prompt).
I also sometimes launch conversations with a pre-configured limit of messages; when the limit is reached, the conversation is asked to dump a brief summary of the state, and the conversation is restarted from scratch (from the original prompt and the summary).
Choosing the right models
I tend to work with cheaper, lower-latency “flash” models, getting adequate performance (cost and latency vs quality) for most of my workflows.
I suppose I could use expensive models for critical and final steps; this is an area where I need to experiment more.
Gather data
I’ve started gathering data whenever I’m starting a programming task for which I think AI could be a good fit.
I’m populating a table with these entries:
- Description of an implementation task
- Date of execution
- Estimated time required (decided before execution)
- Implementation method (an enum: either “human” or “ai”)
- Actual duration
The majority of tracked tasks are estimated to take anywhere from five minutes to one hour.
This lets me track the estimated-to-actual ratios for both approaches.
I choose the implementation method randomly. I started by flipping a virtual coin (50% odds for each), but I’m switching to a multi-armed bandit approach (probably Upper Confidence Bound).
This is a whole topic I’ll explore another time, but I figured I’d mention it here: start gathering performance data! It costs you very little.
Don’t antropomorphize
Sometimes we treat the AI as if it were a fellow human. It is not. It operates in fundamentally different ways. It has failure modes that are completely alien to most humans.
AI is just a tool, like a calculator. When it fails at a given task and we feel frustrated at it, it is worth reminding ourselves that it really is us who are failing (e.g., to use it correctly, or to apply it to problem for which it is a good match), nobody else.
Losing sight of this makes us less efficient.
Related
Dumb AI and the software revolution: Generative AI models are “frustratingly dumb,” yet “astonishingly powerful”, and poised to impact fields like software engineering. The key is to stop waiting for a “genius” and instead harness these fast, flawed collaborators with proper structure. This approach will lead to an explosion of custom software.
Up: Essays on AI