Module specifications in Duende
Posted: 2025-11-05
Introduction
This document explains what Iโve been able to achieve generating Python code from template specifications, implementing the philosophy I described in Embrace the specification.
As of 2025-11-05, I generate 8821 lines of code (2940 of implementation, 5881 of tests) from 1553 lines in specifications.
All code is available
Overview
For a Python module foo I create a
foo.dm.py template file (the specification) with a
valid but under-specified (incomplete) Python program with code like
this:
def reindent_code(code: str, desired_spaces: int) -> str:
"""Returns a copy of `code` with the desired leading spaces.
First finds the longest whitespace prefix that all non-empty `code`
lines contain and removes it (from all lines). Then prepends to all
lines a prefix of the desired length.
{{๐ฆ If an input line (from `code`) is empty or only contains
whitespace characters, the corresponding line in the output is
empty.}}
โฆ more properties โฆ
{{๐ฆ The output must contain at least one line where, if
`desired_spaces` spaces are removed (from the start), the line
starts with a non-space character.}}
"""
raise NotImplementedError() # {{๐ reindent code}}
As you can see, there are two kinds of markers:
Implementation:
{{๐ reindent code}}These are place holders with the meaning of: โdelete this line and replace it with a blockโ. The AI should come up with the best implementation of the intent, as implied by the context around the marker.Property:
{{๐ฆ The output mustโฆ}}These markers specify a desired property of the implementation. They help assist the implementation and generate tests.
Files generated
From a file foo.dm.py, I generate the following
files:
โญโโโโโโโโโโโโฎ โญโโโโโโโโโฎ
โ master: โโโโ(1)โโโโโ> impl: โ
โ foo.dm.py โ โ foo.py โ
โฐโโโฌโโโโโโโโโฏ โฐโโโโโโโโโฏ
โ
(2) โญโโโโโโโโโโโโโโโฎ โญโโโโโโโโโโโโฎ
โฐโโโโโ> test tmpl: โโโ(1)โโ> test: โ
โtest_foo.dm.pyโ โtest_foo.pyโ
โฐโโโโโโโโโโโโโโโฏ โฐโโโโโโโโโโโโฏ
Implementation: foo.py: All the
{{๐ โฆ}}markers are expanded into actual code.Test template (skeleton): Template for unit tests. Ignores the
{{๐ โฆ}}markers in the master. Instead, outputs a unit test with one{{๐ โฆ}}marker for each{{๐ฆ โฆ}}property marker.Test: Full implementation of the test template (expands the
{{๐ โฆ}}markers into actual code).
Generation process
The generation process is supported by two multi-agent workflows:
CodeSpecsWorkflow: Expands each implementation marker into a full block. This is used to produce bothfoo.pyandtest_foo.py. The implementation (specification) is code_specs_workflow.dm.py Runs separate agent loops for each marker:When expanding normal code (e.g.,
foo.py), relies on static type checks (with mypy).When expanding test code (e.g.,
test_foo.py), validates that the block passes static type checks and that the test passes (python3 -m pytest test_foo.py).
CodeSpecsTestsSkeletonWorkflow: Generates thetest_foo.dm.pytest skeleton, turning the{{๐ฆ โฆ}}markers into unimplemented unit tests (each with a{{๐ โฆ}}marker). The implementation is code_specs_tests_skeleton.dm.py
I have a third workflow to gradually enable tests one-by-one and, when there are failures, either fix the code or fix the test. However, I havenโt really used this much. I only accept passing tests during the generation; I keep an eye to see if the AI struggles, an indication of bugs (though, unfortunately, sometimes the AI is overly eager to produce a passing testโฆ).
Current state
As part of Duende, Iโm generating a few modules this way. This is quite meta: the implementation of these workflows is generated using the workflows themselves (like a compiler that compiles its own source code).
As of 2025-11-04, I have this:
| Module | in | ๐ | ๐ฆ | out | test | total | ratio |
|---|---|---|---|---|---|---|---|
| code_specs | 191 | 6 | 33 | 534 | 604 | 1138 | 17% |
| code_specs_agent | 75 | 3 | 9 | 111 | 632 | 743 | 10% |
| code_specs_commands | 282 | 16 | 7 | 415 | 291 | 706 | 40% |
| code_specs_marker_implementation | 57 | 3 | 14 | 146 | 314 | 460 | 12% |
| code_specs_path_and_validator | 71 | 5 | 11 | 156 | 370 | 526 | 13% |
| code_specs_tests_enable | 184 | 6 | 23 | 336 | 0 | 336 | 55% |
| code_specs_tests_skeleton | 182 | 4 | 13 | 392 | 1114 | 1506 | 12% |
| code_specs_validator | 66 | 3 | 14 | 112 | 514 | 626 | 11% |
| code_specs_workflow | 293 | 9 | 24 | 600 | 1940 | 2540 | 12% |
| done_validator_diagnostics | 46 | 3 | 4 | 0 | 0 | 0 | |
| output_cache | 106 | 5 | 4 | 138 | 102 | 240 | 44% |
| total | 1553 | 63 | 156 | 2940 | 5881 | 8821 | 18% |
Legend:
- out: Lines of code in generated implementation
(e.g.,
foo.py) - ๐, ๐ฆ: Number of markers in the specification of the corresponding type.
- test: Lines of code in generated tests (e.g.,
test_foo.py) - total: Total generated lines of code (sum)
- ratio: total / in
Limitations
I still need to manage
importstatements manually, which is annoying. Thatโs because I constrain each agent loop to only be able to edit a single block which it is focused on.I still need to provide significant steering, especially in the first generation. Smaller modules help (the quality of AI responses seems to degrade drastically when working with files above 1K LOC).
The generation process still is somewhat manual. I want to automate it further. This is still somewhat expensive.
In my editor, Iโm frequently jumping from the
foo.dm.pyto thefoo.pyfile and back. I want to invest a bit in improving my editor support, to make it easier to navigate from a marker to its implementation and back.The generation process still lacks many basic features. I expect they will go a long way to make it more robust:
Incorporate review agents (which is already implemented in Duende, just not hooked here).
When expanding markers, add a few more agent loops:
Produce a list of questions that would help the implementer agent.
Produce answers to those questions.
Produce a list of APIs that the implementer agent may need, and give one example call for each.
The test generation is a bit limited. I would like the
{{๐ฆ โฆ}}markers to describe a โpropertyโ (e.g., โthe output is a sum of the inputsโ) and have AI generate various ways to test that. However, right now the properties are closer to specific example calls (e.g., โgiven 2 and 3, should output 5โ), which leaves room for improvement.
Observations
Because the generation process is still somewhat slow (requiring significant guidance), Iโm not yet ready to proclaim this a huge success. However, I do anticipate that working directly on the specifications (and regenerating only the affected parts) will make my software more maintainable.
The
dmfiles tend to leave out many low-level implementation details. They feel refreshingly sparse. They contain a lot of signal with little noise.However, some
dmfiles do include a big part of the implementation. This is a good thing: it shows that the approach is very flexible, allowing you to decide how far down you need to go in code on a case-by-case basis.Caching is critical. I have two levels of caching: avoid redundantly recomputing previous information (e.g., โwhat are relevant files to implement each of these ~100 blocksโ), and try to reuse the previous implementation (as described in Embrace the specification in section Preserving observable unspecified behaviors).
A large portion of the generated code is tests. Thatโsโฆ okay. This project has good unit test coverage at a relatively low cost (just insert the
{{๐ฆ โฆ}}markers).
Related
Embrace the specification The advantages of spec-drive development (based on agentic workflows) over requirement-oriented contexts.
Agentic AI: Recipes for reliable generative workflows: My top lessons for how to apply agentic workflows successfully.
Dumb AI and the software revolution: Generative AI models are โfrustratingly dumb,โ yet โastonishingly powerfulโ, and poised to impact fields like software engineering. The key is to stop waiting for a โgeniusโ and instead harness these fast, flawed collaborators with proper structure. This approach will lead to an explosion of custom software.
Up: Essays on AI