Module specifications in Duende

Posted: 2025-11-05

Introduction

This document explains what Iโ€™ve been able to achieve generating Python code from template specifications, implementing the philosophy I described in Embrace the specification.

As of 2026-03-09, I generate 11299 lines of code (5037 of implementation, 6262 of tests) from 2618 lines in specifications.

All code is available

Overview

For a Python module foo I create a foo.dm.py template file (the specification) with a valid but under-specified (incomplete) Python program with code like this:

def reindent_code(code: str, desired_spaces: int) -> str:
  """Returns a copy of `code` with the desired leading spaces.

  First finds the longest whitespace prefix that all non-empty `code`
  lines contain and removes it (from all lines). Then prepends to all
  lines a prefix of the desired length.

  {{๐Ÿฆ” If an input line (from `code`) is empty or only contains
       whitespace characters, the corresponding line in the output is
       empty.}}

  โ€ฆ more properties โ€ฆ

  {{๐Ÿฆ” The output must contain at least one line where, if
       `desired_spaces` spaces are removed (from the start), the line
       starts with a non-space character.}}
  """
  raise NotImplementedError()  # {{๐Ÿ„ reindent code}}

As you can see, there are two kinds of markers:

Files generated

From a file foo.dm.py, I generate the following files:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ           โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  master:  โ”œโ”€โ”€โ”€(1)โ”€โ”€โ”€โ”€โ”€>  impl: โ”‚
โ”‚ foo.dm.py โ”‚           โ”‚ foo.py โ”‚
โ•ฐโ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ           โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
   โ”‚
  (2)    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ       โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
   โ•ฐโ”€โ”€โ”€โ”€โ”€>  test tmpl:  โ”œโ”€โ”€(1)โ”€โ”€>   test:   โ”‚
         โ”‚test_foo.dm.pyโ”‚       โ”‚test_foo.pyโ”‚
         โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ       โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Generation process

The generation process is supported by two multi-agent workflows:

  1. CodeSpecsWorkflow: Expands each implementation marker into a full block. This is used to produce both foo.py and test_foo.py. The implementation (specification) is code_specs_workflow.dm.py Runs separate agent loops for each marker:

    • When expanding normal code (e.g., foo.py), relies on static type checks (with mypy).

    • When expanding test code (e.g., test_foo.py), validates that the block passes static type checks and that the test passes (python3 -m pytest test_foo.py).

  2. CodeSpecsTestsSkeletonWorkflow: Generates the test_foo.dm.py test skeleton, turning the {{๐Ÿฆ” โ€ฆ}} markers into unimplemented unit tests (each with a {{๐Ÿ„ โ€ฆ}} marker). The implementation is code_specs_tests_skeleton.dm.py

I have a third workflow to gradually enable tests one-by-one and, when there are failures, either fix the code or fix the test. However, I havenโ€™t really used this much. I only accept passing tests during the generation; I keep an eye to see if the AI struggles, an indication of bugs (though, unfortunately, sometimes the AI is overly eager to produce a passing testโ€ฆ).

Current state

As part of Duende, Iโ€™m generating a few modules this way. This is quite meta: the implementation of these workflows is generated using the workflows themselves (like a compiler that compiles its own source code).

As of 2026-03-09, I have this:

Module in ๐Ÿ„ ๐Ÿฆ” out test total ratio
change_directory_command 47 0 0 0 0 0
code_specs 191 6 33 534 604 1138 17%
code_specs_agent 75 3 9 111 632 743 10%
code_specs_commands 282 16 7 415 291 706 40%
code_specs_marker_implementation 57 3 14 146 314 460 12%
code_specs_path_and_validator 71 5 11 156 370 526 13%
code_specs_tests_enable 184 6 23 336 0 336 55%
code_specs_tests_skeleton 182 4 13 392 1113 1505 12%
code_specs_validator 66 3 14 112 514 626 11%
code_specs_workflow 298 9 25 614 2054 2668 11%
command_registry_factory 155 2 0 247 0 247 63%
done_validator_diagnostics 46 3 4 0 0 0
file_access_policy 110 6 0 198 0 198 56%
message_bus 129 11 0 497 0 497 26%
message_queue 25 2 0 61 0 61 41%
output_cache 106 5 4 138 102 240 44%
search_file_command 127 1 0 134 268 402 32%
swarm_commands 129 8 5 254 0 254 51%
swarm_config 85 2 0 258 0 258 33%
swarm_workflow 167 4 5 276 0 276 61%
telegram_adapter 86 2 1 158 0 158 54%
total 2618 101 168 5037 6262 11299 23%

Legend:

Limitations

Observations