Module specifications in Duende

Posted: 2025-11-05

Introduction

This document explains what I’ve been able to achieve generating Python code from template specifications, implementing the philosophy I described in Embrace the specification.

As of 2025-11-05, I generate 8821 lines of code (2940 of implementation, 5881 of tests) from 1553 lines in specifications.

All code is available

Overview

For a Python module foo I create a foo.dm.py template file (the specification) with a valid but under-specified (incomplete) Python program with code like this:

def reindent_code(code: str, desired_spaces: int) -> str:
  """Returns a copy of `code` with the desired leading spaces.

  First finds the longest whitespace prefix that all non-empty `code`
  lines contain and removes it (from all lines). Then prepends to all
  lines a prefix of the desired length.

  {{🦔 If an input line (from `code`) is empty or only contains
       whitespace characters, the corresponding line in the output is
       empty.}}

  … more properties …

  {{🦔 The output must contain at least one line where, if
       `desired_spaces` spaces are removed (from the start), the line
       starts with a non-space character.}}
  """
  raise NotImplementedError()  # {{🍄 reindent code}}

As you can see, there are two kinds of markers:

Implementation: {{🍄 reindent code}} These are place holders with the meaning of: “delete this line and replace it with a block”. The AI should come up with the best implementation of the intent, as implied by the context around the marker.
Property: {{🦔 The output must…}} These markers specify a desired property of the implementation. They help assist the implementation and generate tests.

Files generated

From a file foo.dm.py, I generate the following files:

╭───────────╮           ╭────────╮
│  master:  ├───(1)─────>  impl: │
│ foo.dm.py │           │ foo.py │
╰──┬────────╯           ╰────────╯
   │
  (2)    ╭──────────────╮       ╭───────────╮
   ╰─────>  test tmpl:  ├──(1)──>   test:   │
         │test_foo.dm.py│       │test_foo.py│
         ╰──────────────╯       ╰───────────╯

Implementation: foo.py: All the {{🍄 …}} markers are expanded into actual code.
Test template (skeleton): Template for unit tests. Ignores the {{🍄 …}} markers in the master. Instead, outputs a unit test with one {{🍄 …}} marker for each {{🦔 …}} property marker.
Test: Full implementation of the test template (expands the {{🍄 …}} markers into actual code).

Generation process

The generation process is supported by two multi-agent workflows:

CodeSpecsWorkflow: Expands each implementation marker into a full block. This is used to produce both foo.py and test_foo.py. The implementation (specification) is code_specs_workflow.dm.py Runs separate agent loops for each marker:
- When expanding normal code (e.g., foo.py), relies on static type checks (with mypy).
- When expanding test code (e.g., test_foo.py), validates that the block passes static type checks and that the test passes (python3 -m pytest test_foo.py).
CodeSpecsTestsSkeletonWorkflow: Generates the test_foo.dm.py test skeleton, turning the {{🦔 …}} markers into unimplemented unit tests (each with a {{🍄 …}} marker). The implementation is code_specs_tests_skeleton.dm.py

I have a third workflow to gradually enable tests one-by-one and, when there are failures, either fix the code or fix the test. However, I haven’t really used this much. I only accept passing tests during the generation; I keep an eye to see if the AI struggles, an indication of bugs (though, unfortunately, sometimes the AI is overly eager to produce a passing test…).

Current state

As part of Duende, I’m generating a few modules this way. This is quite meta: the implementation of these workflows is generated using the workflows themselves (like a compiler that compiles its own source code).

As of 2025-11-04, I have this:

Module	in	🍄	🦔	out	test	total	ratio
code_specs	191	6	33	534	604	1138	17%
code_specs_agent	75	3	9	111	632	743	10%
code_specs_commands	282	16	7	415	291	706	40%
code_specs_marker_implementation	57	3	14	146	314	460	12%
code_specs_path_and_validator	71	5	11	156	370	526	13%
code_specs_tests_enable	184	6	23	336	0	336	55%
code_specs_tests_skeleton	182	4	13	392	1114	1506	12%
code_specs_validator	66	3	14	112	514	626	11%
code_specs_workflow	293	9	24	600	1940	2540	12%
done_validator_diagnostics	46	3	4	0	0	0
output_cache	106	5	4	138	102	240	44%
total	1553	63	156	2940	5881	8821	18%

Legend:

out: Lines of code in generated implementation (e.g., foo.py)
🍄, 🦔: Number of markers in the specification of the corresponding type.
test: Lines of code in generated tests (e.g., test_foo.py)
total: Total generated lines of code (sum)
ratio: total / in

Limitations

I still need to manage import statements manually, which is annoying. That’s because I constrain each agent loop to only be able to edit a single block which it is focused on.
I still need to provide significant steering, especially in the first generation. Smaller modules help (the quality of AI responses seems to degrade drastically when working with files above 1K LOC).
The generation process still is somewhat manual. I want to automate it further. This is still somewhat expensive.
In my editor, I’m frequently jumping from the foo.dm.py to the foo.py file and back. I want to invest a bit in improving my editor support, to make it easier to navigate from a marker to its implementation and back.
The generation process still lacks many basic features. I expect they will go a long way to make it more robust:
- Incorporate review agents (which is already implemented in Duende, just not hooked here).
- When expanding markers, add a few more agent loops:
  - Produce a list of questions that would help the implementer agent.
  - Produce answers to those questions.
  - Produce a list of APIs that the implementer agent may need, and give one example call for each.
The test generation is a bit limited. I would like the {{🦔 …}} markers to describe a “property” (e.g., “the output is a sum of the inputs”) and have AI generate various ways to test that. However, right now the properties are closer to specific example calls (e.g., “given 2 and 3, should output 5”), which leaves room for improvement.

Observations

Because the generation process is still somewhat slow (requiring significant guidance), I’m not yet ready to proclaim this a huge success. However, I do anticipate that working directly on the specifications (and regenerating only the affected parts) will make my software more maintainable.
The dm files tend to leave out many low-level implementation details. They feel refreshingly sparse. They contain a lot of signal with little noise.
However, some dm files do include a big part of the implementation. This is a good thing: it shows that the approach is very flexible, allowing you to decide how far down you need to go in code on a case-by-case basis.
Caching is critical. I have two levels of caching: avoid redundantly recomputing previous information (e.g., “what are relevant files to implement each of these ~100 blocks”), and try to reuse the previous implementation (as described in Embrace the specification in section Preserving observable unspecified behaviors).
A large portion of the generated code is tests. That’s… okay. This project has good unit test coverage at a relatively low cost (just insert the {{🦔 …}} markers).

Embrace the specification The advantages of spec-drive development (based on agentic workflows) over requirement-oriented contexts.
Agentic AI: Recipes for reliable generative workflows: My top lessons for how to apply agentic workflows successfully.
Dumb AI and the software revolution: Generative AI models are “frustratingly dumb,” yet “astonishingly powerful”, and poised to impact fields like software engineering. The key is to stop waiting for a “genius” and instead harness these fast, flawed collaborators with proper structure. This approach will lead to an explosion of custom software.
Up: Essays on AI