Our Architectural Guardrails for AI-Generated Code

Published by Manuel Rivero on 26/04/2026

AI, Architecture Tests, Testing, AI Guardrails

Introduction.

The idea to write this post came about almost half a year ago, but the flood of events in the industry and my personal life has not allowed me to write about it until now.

During a deliberate practice session^[1] where the pair I was assisting was programming using Cursor, I watched in amazement as Cursor’s autocomplete suggested using test builders to instantiate an object in production code, and nobody noticed.

I pointed out the problem to the pair so they could fix it, and advised them to be more careful with what the AI was suggesting. But I was left with the feeling that my advice is much easier said than done, because it’s not easy to always be vigilant, since it’s difficult to spot problems in code that’s mostly fine. This difficulty is exacerbated when the code is generated by agents due to the sheer volume of code generated.

I believe we need all the help we can get in this task of evaluating AI-generated code.

What we’d like to avoid.

As we’ve already discussed, the speed and size of the increments when using code agents make it harder for developers to detect these problems. When evaluating code that appears mostly correct for extended periods, phenomena like code review fatigue or cognitive surrender can occur, making one more likely to accept AI-generated code (even small amounts suggested by autocomplete) without careful evaluation.

This evaluation of AI-generated code gets even more difficult when developers have not yet developed enough knowledge to have criteria to evaluate the design implicit in the solution proposed by AI.

Since one of the main objectives of deliberate practice sessions is precisely to develop critical skills to evaluate design options^[2], we decided to support the participants with some minimal design rules that automated away basic problems we’d like to avoid^[3].

1. Domain should not depend on infrastructure.

Business rules should guide software system design because they represent the core logic and purpose of the organization, which should remain independent from changing technologies or external systems. In addition, by isolating the domain from infrastructure, the business logic remains easier to test, maintain, and evolve independently of external technologies. This approach also reduces accidental complexity, avoiding the leakage of infrastructure details into the business language and preserving a cleaner, more adaptable architecture.

This rule helps us strictly keep the dependency rule described in the ports and adapters pattern.

2. Production code should not have circular dependencies.

Circular dependencies because they blur module boundaries and make systems harder to understand, maintain, and safely evolve. Changes in one component tend to propagate unpredictably into the other, increasing coupling and making refactoring riskier. Cycles also create fragile initialization behavior, often causing import-order issues or partially initialized objects at runtime. Most importantly, they hurt testability: components cannot be instantiated or mocked independently, which makes unit testing and dependency injection harder and pushes teams toward slower, more brittle integration tests.

3. Production code should not depend on tests.

Production code should not depend on tests because tests are meant to verify behavior, not define or support runtime structure. When production relies on tests, it creates tight coupling between validation code and application logic, making the system fragile and harder to maintain. Changes to tests can unintentionally affect production behavior, undermining the reliability and independence that tests are supposed to provide. This creates fragility and deployment risks. It also blurs responsibilities by making tests both validators and runtime collaborators, which undermines separation of concerns. Such coupling encourages the introduction of test hooks, fake abstractions, and conditional behaviors into production code, reducing cohesion and maintainability.

This may seem obvious, but it was precisely this rule that I observed code suggested by Cursor breaking.

How to enforce these design rules.

We evaluated three primary enforcement strategies: architecture tests, reviews with agents, and documentation/skills, but in the end, we decided to use architecture tests.

Why not skills, docs or agentic reviews: guidelines vs guardrails, determinism and costs.

Architecture tests are automated tests that can be used to verify the structure and design of our code to a certain extent, and as such, can help us enforce some design rules, acting as automated constraints that make it deterministically “difficult to do the wrong thing.”

Documentation (like AGENTS.md) and skills for agents are used to enhance the probability of generating code according to what we expect. While they are essential when working with coding agents, they function merely as guidelines, and, as such, we think they are too weak to enforce design rules like the ones mentioned above. We think that, in the case of design rules like ours, we need to move from “best effort” to “guaranteed compliance,” shifting our focus from guidelines toward guardrails.

Reviews with agents, while being a form of guardrail, introduce a layer of probabilistic uncertainty and recurring costs that make them less reliable and more expensive, respectively, than architecture tests.

The following analysis explains in more detail why we prioritize deterministic architecture tests over the other options evaluating them across two critical dimensions: likelihood of adherence and token cost and efficiency.

1. Architecture Tests (a reliable guardrail).

Architecture tests turn objective design rules into executable code^[4].

a. Likelihood of adherence.

Architecture tests are fully deterministic. If we integrate them into our CI/CD pipeline, we achieve a 100% adherence rate because a rule violation physically prevents the build from passing, they act as binary gates for the rules. They are automated fitness functions^[5] that ensure the enforcement of the design rules.

b. Token Cost and Efficiency.

These tests execute locally on the CPU with zero token cost. They provide instant feedback during the inner development loop without hitting any external API, making them the most cost-effective and lowest-latency option.

2. Docs/Skills (guidelines).

Documentation and skills rely on human memory and/or the LLM’s ability to follow instructions.

a. Likelihood of adherence.

Documentation and skills enhance the probability of getting a generated code that follows our design rules.

Under pressure or high cognitive load, humans tend to prioritize task completion over structural constraints. In a similar way an agent, due to the non-deterministic nature of LLMs, may ignore instructions, or take shortcuts in their eagerness to complete tasks.

Agents can also suffer from their version of high cognitive load, suffering from instruction-following degradation as their context window gets too full: context rot.

The amount of documentation we feed an agent can worsen context rot because every rule added to a loaded “skill” library or prepended to a prompt consumes space in the context window. We need to be careful deciding which skills and documentation are worthy to occupy space in the context window.

Even if we manage effectively the context window, the important point is that design rules defined as guidelines are susceptible to “drift.” because agents may often bypass or forget guidelines.

b. Token Cost and Efficiency.

Every loaded “skill” or documentation prepended to a prompt lead to higher per-request costs and increased latency.

We also saw how context rot produces a performance degradation which may require more attempts to complete a given task increasing the number of requests.

Unlike “free” local architecture tests, enforcing design rules that are really objective like ours using documentation or skills creates an increase in operating costs.

3. Reviews with Agents (an unreliable guardrail).

This method uses a second LLM pass to “audit” code against a set of standards (passed as documentation or skills).

a. Likelihood of adherence.

Because agentic reasoning is non-deterministic, this kind of enforcement will be “best-effort” rather than guaranteed. It is probabilistic and inconsistent by nature. An agent may overlook subtle violations if the primary logic looks correct. This creates an unreliable feedback loop where the same violation might pass one review and fail the next.

b. Token Cost and Efficiency.

This approach introduces a massive “token tax.” Every line of code must be sent and processed twice, once for generation and once for verification. This leads to escalating API costs and higher latency, making constant evaluation of design rules an expensive recurring expense, and a source of happiness for your AI-inference providers 😅.

Conclusion.

We hope that we have been able to communicate with this analysis the reasons why we prefer to use architecture tests to enforce design rules like ours rather than the two other options: they are an automatic, cost-effective, unambiguous, deterministic and reliable way to enforce objective rules.

Linters vs. architecture tests for design rules.

We also experimented with linters as guardrails to enforce our design rules. From the point of view of likelihood of adherence, as well as token cost and efficiency, linters are as valid as architecture tests. Both approaches can automatically reject changes that violate predefined constraints, making them equally viable enforcement mechanisms.

However, we prefer architecture tests over linters because we find that architecture tests make the intent of design rules easier to read. With architecture tests, design rules are usually expressed directly in code, and described by the name or description of the test itself. In linters, the same rules tend to live in JSON or configuration files, in a more cryptic format that is harder to interpret at a glance.

Our minimal architecture tests.

We wrote these architecture tests with the help of a coding agent and then did “agentic mutation testing” in the code to see if they really detected violations of the rule, (we’ll talk more about agentic mutation testing, and how to offload it to determinism in a future post).

For the deliberate practice sessions.

These are the architecture tests in TypeScript we are using in the deliberated practice sessions:

We used basic functionalities from the ArchUnitTS architecture testing library to enforce our minimal design rules.

For a client’s project.

We have also added architecture tests to a project my colleague Fran Reyes and I are developing for a client.

We hadn’t felt the need to automatically enforce design rules in this project until we started using agents to delegate some tasks. When Fran and I were pair programming without agents we never produced the problems those rules are meant to avoid, probably thanks to keeping a more human pace and sharing a mental model through pair programming. But now that we’re using coding agents, we decided that adding architecture tests to the project would be helpful.

These are the architecture tests in Java we are using in our project:

They contain more rules than the ones we use for the deliberate practice sessions. In this case, we are using ArchUnit architecture testing library (which I think is a complicated beast) to enforce the design rules.

Conclusions.

The frequency and size of AI-generated code increments often outpaces our natural capacity for vigilance. The risk of cognitive surrender or fatigue is real. When an LLM consistently produces code that looks “mostly right,” it becomes increasingly difficult to spot subtle design drifts. We think that relying solely on guidelines is insufficient to maintain high design standards under these conditions.

We believe that to successfully integrate AI agents into our workflow, we must complement guidelines with guardrails. While documentation and skills can help steer an LLM in the right direction, they remain probabilistic and context-heavy. By contrast, architecture tests offer a deterministic, cost-effective way to ensure compliance. They allow us to at least offload the mental burden of structural verification to the CPU, enforcing our core design rules regardless of who, or what, is writing the code.

By implementing minimal architecture tests, we have created a safety net that protects our core design rules. These tests function as automated fitness functions that provide instant, local feedback. This approach continually enforces our design rules without burning any token.

We are using these tests, both in deliberate practice sessions and client projects, in which AI-generated code is growing. In our experience, they have proven very useful because while we cannot always prevent an agent from generating code that breaks a desired design rule, guardrails like architecture tests can certainly prevent that code from ever becoming a permanent part of our system.

Acknowledgements.

I’d like to thank Fernando Aparicio, Fran Reyes and Antonio de la Torre for giving me feedback about several drafts of this post.

References.

Notes.

[1] Our deliberate practice sessions with Audiense are designed with two goals in mind:

Keep fundamental engineering skills strong, avoiding skill atrophy induced by AI.
Introduce new technical practices and ideas to the team on a continuous basis.

[2] In the deliberate practice sessions, we try to transmit a style of OO design based on the ports and adapters pattern combined with Responsibility-driven design, Domain-Driven Design principles, other OOP principles, and when they apply, some classical OOP design patterns.

[3] We could also have banned AI completely during deliberate practice sessions. Instead we decided to work without agents, but still allow using AI autocomplete in the IDEs. The reason for this is that we believe that generating code at a more “human pace” is more instructive for learning and practicing the type of skills we want to develop and maintain.

[4] According to Birgitta Boeckeler “LLMs are great for exploratory and fuzzy rules, but once you have [a rule] that really is objective, converting it to a formal, unambiguous, deterministic format can give you more assurance”.

Our design rules are objective, so they are best implemented in a formal, unambiguous, and deterministic format such as architecture tests or linter rules.

[5] To learn more about fitness functions have a look at chapter 2 of Building Evolutionary Architectures, 2nd edition.

Volver a posts