Agent Governance for Software Development
Governing a small fleet of AI coding agents: lessons on specification, tooling, orchestration, and audit trails with RiskNodes.
Agent Governance for Software Development
A month ago we wrote about pivoting RiskNodes from third-party risk management to agent governance. This post is the practical follow-up: what it actually looks like to govern a small fleet of AI agents — the agents building our own product.
If you advise clients on how to introduce AI agents into regulated workflows, you have probably noticed the same gap we did. The vendors selling agents are very keen to talk about capability. They are considerably less keen to talk about evidence, attribution and the audit trail your client’s compliance team will want before any of this is allowed near a production system.
That gap is the reason RiskNodes exists. To validate our assumptions, we run our own development through RiskNodes itself1.
The setup
Four small agents, each with one job, wired together by labels on our git host. No bespoke orchestration framework, no streaming planner, no vector database. The schedule and the labels do most of the work.
flowchart LR
A[New issue]:::issue --> T(Triage agent):::agent
T -->|accept + bug| I
T -->|accept + enhancement| S(Spec agent):::agent
T -->|needs more info| H1[Human]:::human
S -->|Gherkin spec| H2[Human review]:::human
H2 --> I(Implementer agent):::agent
I -->|pull request| R(Reviewer agent):::agent
R -->|pass| H3[Human approves merge]:::human
R -->|fail| I
classDef issue fill:#e8e8e8,stroke:#555,color:#000
classDef agent fill:#cfe8ff,stroke:#1f6feb,color:#000
classDef human fill:#fde2a8,stroke:#b8860b,color:#000
The Triage agent reads incoming issues and decides whether they are clear enough to act on. The Spec agent drafts a BDD-style feature file for anything that changes user-facing behaviour. The Implementer cuts a branch and writes the code. The Reviewer runs the linters and tests, then a second model critiques the diff against the spec.
A human writes the original issue and approves the merge. Everything in between is automated, but every step is recorded.
Lessons from Local Agentic Coding Orchestration
The bottleneck is the specification, not the code
The novelty of watching an agent produce a working pull request wears off quickly. Then, the limiting factor stops being code generation and becomes writing down what you actually wanted clearly enough that an agent can produce the right thing.
This is not a new problem. It is the old problem of software estimation, freshly exposed. Human developers used to paper over under-specified requirements by discussing the gaps over coffee. Agents do not bother to ask for clarifications. They will cheerfully implement exactly the wrong thing, with tests, and ask for it to be merged.
The Spec agent in our pipeline exists for this reason. It reads a short prose issue and drafts a structured, example-driven feature file. The human edits that file before any code is written. We treat the spec as the contract; the code is downstream.
The implication for governance work: the artefact your client needs to control is the specification, not the prompt and not the model. Models change every quarter. A well-written specification outlives all of them.
”Agentic” turns out to be mostly plumbing
There is an assumption in the air that adopting agents requires a sophisticated orchestration framework. We have written approximately none of that. Instead, our runner is Dagu — an excellent, single-binary, YAML-driven task scheduler.
Rather than wrestling with a heavy data-engineering orchestrator or a black-box “agentic” framework, Dagu
provides exactly the primitives an agent pipeline actually needs: declarative step dependencies, crisp
concurrency control (max_active_runs: 1 prevents agents from stomping on each other’s git branches),
and trivial cross-pipeline triggers via its native REST API.
Crucially, it makes observability and state management effortless. If an agent times out or a linter fails a quality gate, Dagu’s failure handlers seamlessly catch the exit code, run a snippet to label the underlying issue “Implementation Blocked”, and point the reviewer directly to Dagu’s built-in web UI for the run logs.
Each agent is just a standalone Python script. They communicate through git, through Dagu’s lightweight parameter passing, and through labels on the issue tracker.
What this buys you, that a fashionable orchestration framework would not, is the ability to read the entire system in one sitting and explain it to a regulator. That turns out to be the property compliance teams actually care about. “Auditable” is a more durable adjective than “clever”.
Three nested state machines, all of them explicit
An agent pipeline is not one workflow but several layered on top of each other: the project workflow, the per-issue lifecycle, and the per-question state of every governance check inside the review.
If any of these layers is informal — held in someone’s head, in a chat channel, or in a label called To Do / Doing / Done — the loop breaks. Agents need explicit state because they cannot infer it from corridor conversations.
This, by the way, is why we are quietly preparing to move our own governance state out of issue labels and into RiskNodes’ own workflow engine. Labels-as-state-machine works at the smallest scale and stops working very quickly thereafter. Anyone who has tried to scale a Kanban board past one team will recognise the symptom.
Local inference is now genuinely usable
A material part of the case for an agent governance product is that it can run inside the client’s perimeter, with no data leaving for a third-party API. That used to be aspirational. It is now real, on hardware that fits under a desk, with open-weights models that are good enough to draft a specification, write a small refactor, and critique a diff.
For consultants working with financial services, public sector, and defence clients, this is not a minor detail. “We sent your customer records to a Californian inference endpoint” is a sentence that ends projects. Being able to demonstrate the entire loop running on a machine the client owns changes the conversation.
The tools you give the agent matter more than the model
A great deal of public discussion about agent quality is really
discussion about model choice. In our experience the choice of tools
the agent can call can be more significant. A modest
local model with the right tools comfortably outperforms a frontier
model armed only with grep and cat.
Three pieces of tooling are doing most of the work for us:
-
Hektik — a small semantic and full-text code search engine we built for our own use. It indexes the codebase into SQLite (using
sqlite-vecfor vectors and FTS5 for lexical search) and exposes the results via a CLI and an MCP server, so the agents can ask natural-language questions about the code without us having to ship large swathes of source into the prompt. The point is not novelty; the point is token economy. A targeted answer of a few hundred lines beats fifty thousand tokens of speculatively-included context, both for cost and for the quality of the resulting reasoning. Hektik is currently tailored to the patterns we use — SQLAlchemy ORM conventions, Vue components, our own HTTP and permission decorators — but the language support (tree-sitter) and the inference backend (Ollama) are the obvious extension points if anyone else finds it useful - codeberg repository . -
import-linter — a Python tool that enforces architectural rules about which modules are allowed to import which others. It sounds pedestrian; it is the single most effective guard-rail we have against an agent quietly entangling layers that we deliberately separated. The rules live in a config file the agent can read; if it proposes an import that violates them, the build fails before the pull request reaches review.
-
pyscn — a static-analysis tool that flags duplicated code, excessive cyclomatic complexity, and architectural drift. We expose it to the Reviewer agent as another MCP tool. It is particularly good at catching the failure mode where an agent, asked to add a feature, copy-pastes an existing function rather than refactoring it. Humans do this too; the difference is that pyscn will tell you about it before the diff is merged.
The pattern across all three is the same: deterministic tools that return small, structured answers. The agent is not trusted to remember architectural rules, count function complexity, or carry the shape of a 200,000-line codebase in its head. It is trusted to ask good questions of tools that can. For a consultant designing an engagement, this is probably the most directly transferable lesson: budget for tooling, not just for inference.
Different agents, different models — and a useful old distinction
A second lesson, related to the first, is that there is no reason to use the same model for every agent in the pipeline. The Triage agent is doing short, structured classification; a small, fast model is fine. The Spec agent is doing creative drafting against a constrained format; something stronger pays for itself. The Implementer benefits from the largest, most capable model we can sensibly run; the Reviewer, slightly counter-intuitively, often does better with a different model from the Implementer, because we want a second opinion rather than the same model nodding along to its own work.
Treating each role as an independent procurement decision — model, context window, tools, cost — is closer to how you would staff a real team than the prevailing “one model to rule them all” assumption.
There is a quieter benefit, too. Models trained by different labs on different data tend to fail in different ways. Run the same artefact past three models from three providers and the chance that all three miss the same defect is materially lower than running it past one model three times. This is not a new idea — it is the same logic that puts more than one auditor on a contentious set of accounts, and the same logic that underpins jury trials.2 It travels surprisingly well into agent design. For high-consequence checks (a security review, a regulatory interpretation, a change to a financial calculation) the cost of running two or three independent models in parallel is negligible compared with the cost of an oversight reaching production.
It also surfaces an older distinction that has gone a bit out of fashion: probabilistic versus deterministic computing. Language models are probabilistic; they will give you a slightly different answer each time, and a wrong answer some of the time. The tools they call — the linter, the type checker, the test suite, the import graph — are deterministic; they give you the same answer every time and they are right by construction.
A well-designed agent pipeline is mostly an exercise in deciding which parts of the work belong on which side of that line. Anything that can be deterministic — schema validation, architectural rules, test execution, audit recording — should be. The probabilistic part is reserved for the work that genuinely requires judgement, and is then immediately checked by something deterministic. For governance purposes this is the only honest way to talk about reliability: not “the model is 95% accurate” (a number that means nothing on its own) but “here is the deterministic envelope inside which the probabilistic component is allowed to operate”.
If your client cannot answer:
- Which agent made this change?
- Against which version of the specification?
- Who approved it, and on what evidence?
- Has the same proposal been rejected before?
— then they have not deployed an agent. They have deployed a liability with a chat interface. Converting the first into the second is most of the engagement.
Trust (a little) but Verify (more)
In the spirit of not over-selling: agents are still not good at architecture. They are good at writing code that fits the pattern of the code already there. Ask one to introduce a new abstraction, or notice that two existing abstractions should be unified, and you will get a polite, well-tested implementation of the wrong thing. Our implementer is deliberately scoped to small, well-specified changes for this reason — and any client pilot that ignores this constraint will tell you something painful within a fortnight.
Local inference, similarly, works — and also crashes, runs out of memory, and produces silently degraded output when a graphics driver has a bad day. Anyone telling you that on-premise LLM deployment is a solved problem has not deployed one. This is solvable, but it is operational work that needs to be in the project plan.
What this means if you are scoping an agent governance engagement
A few suggestions, generalised from our own bruises:
-
Define the unit of work first. “Help with coding” is unmeasurable. “Take a written specification and produce a pull request that can be diffed against the spec” is measurable. Start narrow; broaden later.
-
Make the specification a first-class artefact. Versioned, reviewed, signed off. This is where your client’s domain expertise actually lives, and it is what they are buying governance over. The prompt is an implementation detail.
-
Build the audit trail before you build the agent. It is much easier to add agents to a system that already records every decision than to retrofit auditing onto a fleet of opaque processes.
-
Keep the orchestration boring. Cron, YAML, labels, git. The exotic framework can come later if it is genuinely needed; it usually isn’t, and “boring” is what your client’s auditor will thank you for.
-
Insist on a kill switch a non-engineer can find. “Pause all agent activity” should be a single command anyone in the building can run. If it isn’t, the project is not ready for production.
What’s next
The current loop is enough to ship real changes to RiskNodes itself. The next steps are to replace the issue-label state machine with RiskNodes’ own workflow engine, and to publish the governance questionnaire we use to review agent pull requests as a worked example. We expect that questionnaire to be the most directly re-usable artefact for consultants designing their own engagements.
If you are scoping agent governance work for a client and any of the above sounds useful, we would like to hear from you. RiskNodes is open source under the EUPL; the agents, the pipeline definitions and the specifications are all in the public repository, available for inspection, criticism and reuse.
Previously: From TPRM to Agent Governance — why we pivoted.
Footnotes
-
“Eating your own dog food” (or dogfooding): the software industry’s daft term for acting as the first victim of your own product before inflicting it on paying customers. ↩
-
The formal version of this argument is Condorcet’s Jury Theorem. Roughly: if each juror is more likely than chance to be right, and their judgements are independent, the probability that the majority verdict is correct rises with the size of the jury. The two assumptions are critical — competence above chance, and independence. The same two assumptions determine whether running three AI models in parallel is genuinely useful or merely expensive. The Stanford Encyclopedia of Philosophy entry on Voting Methods has a readable modern treatment. ↩
Share this article: