Module specifications in Duende

Posted: 2025-11-05

Introduction

This document explains what Iโ€™ve been able to achieve generating Python code from template specifications, implementing the philosophy I described in Embrace the specification.

As of 2025-11-05, I generate 8821 lines of code (2940 of implementation, 5881 of tests) from 1553 lines in specifications.

All code is available

Overview

For a Python module foo I create a foo.dm.py template file (the specification) with a valid but under-specified (incomplete) Python program with code like this:

def reindent_code(code: str, desired_spaces: int) -> str:
  """Returns a copy of `code` with the desired leading spaces.

  First finds the longest whitespace prefix that all non-empty `code`
  lines contain and removes it (from all lines). Then prepends to all
  lines a prefix of the desired length.

  {{๐Ÿฆ” If an input line (from `code`) is empty or only contains
       whitespace characters, the corresponding line in the output is
       empty.}}

  โ€ฆ more properties โ€ฆ

  {{๐Ÿฆ” The output must contain at least one line where, if
       `desired_spaces` spaces are removed (from the start), the line
       starts with a non-space character.}}
  """
  raise NotImplementedError()  # {{๐Ÿ„ reindent code}}

As you can see, there are two kinds of markers:

Files generated

From a file foo.dm.py, I generate the following files:

โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ           โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  master:  โ”œโ”€โ”€โ”€(1)โ”€โ”€โ”€โ”€โ”€>  impl: โ”‚
โ”‚ foo.dm.py โ”‚           โ”‚ foo.py โ”‚
โ•ฐโ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ           โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
   โ”‚
  (2)    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ       โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
   โ•ฐโ”€โ”€โ”€โ”€โ”€>  test tmpl:  โ”œโ”€โ”€(1)โ”€โ”€>   test:   โ”‚
         โ”‚test_foo.dm.pyโ”‚       โ”‚test_foo.pyโ”‚
         โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ       โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Generation process

The generation process is supported by two multi-agent workflows:

  1. CodeSpecsWorkflow: Expands each implementation marker into a full block. This is used to produce both foo.py and test_foo.py. The implementation (specification) is code_specs_workflow.dm.py Runs separate agent loops for each marker:

    • When expanding normal code (e.g., foo.py), relies on static type checks (with mypy).

    • When expanding test code (e.g., test_foo.py), validates that the block passes static type checks and that the test passes (python3 -m pytest test_foo.py).

  2. CodeSpecsTestsSkeletonWorkflow: Generates the test_foo.dm.py test skeleton, turning the {{๐Ÿฆ” โ€ฆ}} markers into unimplemented unit tests (each with a {{๐Ÿ„ โ€ฆ}} marker). The implementation is code_specs_tests_skeleton.dm.py

I have a third workflow to gradually enable tests one-by-one and, when there are failures, either fix the code or fix the test. However, I havenโ€™t really used this much. I only accept passing tests during the generation; I keep an eye to see if the AI struggles, an indication of bugs (though, unfortunately, sometimes the AI is overly eager to produce a passing testโ€ฆ).

Current state

As part of Duende, Iโ€™m generating a few modules this way. This is quite meta: the implementation of these workflows is generated using the workflows themselves (like a compiler that compiles its own source code).

As of 2025-11-04, I have this:

Module in ๐Ÿ„ ๐Ÿฆ” out test total ratio
code_specs 191 6 33 534 604 1138 17%
code_specs_agent 75 3 9 111 632 743 10%
code_specs_commands 282 16 7 415 291 706 40%
code_specs_marker_implementation 57 3 14 146 314 460 12%
code_specs_path_and_validator 71 5 11 156 370 526 13%
code_specs_tests_enable 184 6 23 336 0 336 55%
code_specs_tests_skeleton 182 4 13 392 1114 1506 12%
code_specs_validator 66 3 14 112 514 626 11%
code_specs_workflow 293 9 24 600 1940 2540 12%
done_validator_diagnostics 46 3 4 0 0 0
output_cache 106 5 4 138 102 240 44%
total 1553 63 156 2940 5881 8821 18%

Legend:

Limitations

Observations