agents-govern in this project

Language: Suomeksi → governance-study.md

This page describes what the agents-govern framework is, what problems it tries to solve, what "learning" means inside the framework, and how this project (blue-marlin) uses it.

The structural side (agents, communication channels, gates) lives in its own document: Agents map.

What is agents-govern?

Source, lightly adapted: agents-govern README (CC-BY-SA-4.0).

agents-govern is a governance framework for multi-agent AI systems in software development. When multiple AI agents work together on a codebase — planning, coding, testing, reviewing, deploying — they need boundaries, quality gates, and accountability. Without these, you get authority conflicts, capability drift, accountability gaps, and knowledge decay. The framework defines the structure to prevent those failures.

What this is: Governance of multi-agent collaboration in software development workflows — the boundaries, gates, and accountability needed when AI agents (and humans) collaborate to plan, code, review, test, and deploy software.

What this is not:

Not a general AI governance platform. The framework does not govern model behavior, prompt content, or AI products as artifacts.
Not an ML governance framework. It does not handle experiment tracking, model cards, dataset lineage, evaluation sign-off, or drift monitoring.
Not an agent runtime or orchestrator. It defines roles, gates, and artifacts; it does not execute agents.

The framework is open source (CC-BY-SA-4.0). This project uses v0.34.0 (installed from the release tarball on 2026-04-26).

Five problems the framework addresses

Source, lightly adapted: framework.md §1 (CC-BY-SA-4.0).

Before designing agents, understand what goes wrong without governance. These problems were identified empirically in production multi-agent systems:

Authority without boundaries. Two agents both believe they own a technical decision. The Planner scopes a feature one way; the Architect redesigns it. Neither knows the other acted — the result is incoherent oscillation between competing visions.
Capability drift. An agent asked to "improve the documentation" decides that means refactoring the codebase. A "review this PR" agent starts making its own commits. Without constraints, agents expand their scope to match their capabilities, not their mandate.
The accountability gap. Agent A delegates to Agent B, which calls Agent C, which modifies a shared resource. When something breaks, there is no trace of the delegation chain. You see the symptom but not the cause.
Local optimization, global misalignment. Each agent optimizes its local objective. The coder writes elegant code, the tester achieves high coverage, the deployer ships fast. Each is right within its own scope, but the system-level outcome can still be wrong.
Knowledge decay. What the framework has previously learned evaporates. The same bug is rediscovered repeatedly because no prior solution (or attempted solution) is recorded in any searchable form.

"Learning" in this context

agents-govern is an evidence-driven framework. That phrase has concrete structural meaning:

What "learning" is NOT

It is not AI model training — the framework does not touch model weights or fine-tuning datasets.
It is not prompt tuning for a single task.
It is not code refactoring.

What "learning" IS

A learning record in the framework is a YAML structure that captures one concrete observation from running the project's governed pipeline. Each entry contains, at minimum:

Category — gap (missing check), validation (a model confirmed to work), adaptation (a project-specific tweak), tension (a conflict between rules)
Severity — informational, minor, significant, critical
What happened — prose description of the situation
Which agents / gates / rules were involved
Which framework section the observation touches (when relevant)
Business impact (e.g. prevented_loss, escaped_to_production, etc.)

Records live in learnings/<codename>.yaml — for this project, blue-marlin.yaml.

Where learning leads

Learning is a feedback loop into the framework's own evolution:

An adopter project hits a gap, validates an assumption, or adapts a rule → records a learning entry
The entry is submitted upstream (issue / MR)
The InfoSec Sentinel and Contribution Auditor agents review the entry (does it leak information? is it manipulative?)
Once an observation has corroboration from multiple projects, the framework version is revised — into a rule, a new gate check, or a tier promotion
Single-adopter evidence stays provisional until a second adopter hits the same thing

This is why even individual entries are valuable: they are raw evidence on which the framework evolves — they don't need a "solution" at submission time.

This project's adoption

Setting	Value
Adoption layout	Layout B — framework vendored under `agents-govern/`
Codename	`blue-marlin` (anonymous identifier in upstream learnings)
Framework version	v0.34.0
Adoption started	2026-04-26
Active agents	6 (Agents map)
Active gates	2 (Gate 1 + Gate 2)
Human Governor	Jani Päijänen
LLM driver	Claude AI (via Claude Code)

What this project has surfaced so far

The project has captured 17 learning entries in blue-marlin.yaml. Distribution:

Category	Count	Severity	Count
`gap`	6	`critical`	1
`adaptation`	6	`significant`	5
`validation`	5	`minor`	8
		`informational`	3

From the framework's perspective the most valuable entries are the gap-class ones (the framework didn't cover the situation — three of these became upstream issues and one became a feature proposal), and the critical-severity entry (a single one but a meaningful demonstration):

Iter 13: A top-waling beam was placed on top of the deck (148 mm trip hazard at the deck edge). All 11 pytest invariants in place at the time approved the change. The Human Governor caught it in the Gate 2 visual review. → Iter 15 relocated the beam below the deck and added test_top_waling_below_deck_SAFETY as a new invariant.

Upstream proposals

ID	Topic	Status
C1	Output-level invariants (Iter 7 gap)	Submitted (issue #39)
C2	Explicit visual acceptance gate (Iter 13 gap)	Submitted (issue #40)
C3	Lowest-common-denominator output (Iter 9–10 gap)	Submitted (issue #41)
D1–D4	Documentary batch (4 minor)	Draft ready
E1	`agov-render-agents-map` (new framework command + prototype)	Draft ready

What the gates have caught

Concrete examples where Gate 2 review produced value (Gate 1 has mostly been fast-tracked in this project for small tasks):

Iter	What the gate caught	Severity
13→15	Beam placed above deck (trip hazard)	Critical
7	X-cross brace rendered horizontal due to rotation bug	Significant
7	Lower-waling z-formula placed it ABOVE the "upper" waling	Significant
6	Pytest invariants didn't catch the visual bug	Significant
9–10	DXF $INSUNITS missing — CAD tools mis-interpreted scale	Minor
14a	DXFs missing unit suffix on dimension labels	Minor

What the study shows so far

Observation. The governance process surfaces visible incidents that would otherwise ship invisibly:

The Iter 13 trip hazard would have shipped without the Gate 2 visual review.
The pre-existing geometric bugs (Iter 7) would have stayed in laituri_3d.py indefinitely without review.
The discrepancy between the framework's own MANIFEST file and the project's learning record only surfaced when the Agents map prototype tried to consume both as if they were the same.

Caveats. This is illustrative, not statistical:

Sample size = 1 project. Upstream evidence requires N>1 projects' corroboration.
No formal control group. REPLICATION-BRIEF.md defines a baseline task that could serve as a comparison point if anyone runs it.
The Human Governor (Jani) is also the person scheduling the work. "Does the governance process catch more than a careful human alone would?" remains an open question.

Deeper links

Upstream agents-govern repo
Agents map — agents, gates, communication channels
Learning record source
Upstream submissions
Replication brief (control group)
Disclaimer

The concept sections (What is agents-govern, Five problems) are adapted from the framework's own README and framework.md, both licensed CC-BY-SA-4.0. The remaining sections are this project's own content.