We've been running agents against production APIs for about a year. They do a lot of work that used to take us weeks. There is one thing no model does well yet, and it is the thing that actually matters.
Nothing on the market can model authorization as a graph.
You can hand a frontier model every line of an API definition and it'll inventory every endpoint, list every role annotation, and produce a clean spreadsheet of what's protected by what. That part works. The part that doesn't work is asking the same model whether someone could route through a sequence of those endpoints and end up somewhere they shouldn't be. The model doesn't think in those terms. We've tried different scaffolding for it. It still mostly doesn't.
This isn't really a model limitation. It's a representation problem. The codebase taught the model that auth is a list of annotations, because that's what the codebase says auth is. The actual security question is in the spaces between the annotations, and the spaces don't have annotations on them.
The same gap shows up in every security review I've read.
What the list misses
Chains are invisible to the list. Two endpoints each enforce their own rule correctly. Sequence them and the user ends up somewhere neither rule was meant to allow. Every check passed. The user is still in the wrong place.
Write-then-read is invisible to the list. An endpoint lets the user set a value the system later trusts. In the list view, the endpoint took input, validated it, returned 200. Job done. In the graph view, it just connected the user's authority to a node where new edges open up.
Authority inheritance through state is invisible to the list. The user transitions the system into a configuration where the system itself calls a privileged operation on their behalf. The check on that operation fires against the system, not the user. The list says auth ran. It did. Against the wrong principal.
Why the agents inherit it
The framework teaches them the list view. Spring annotations, NestJS guards, GraphQL field directives, tRPC procedure wrappers. They all say the same thing: this thing requires that role. None of them says: here are the states this thing produces, here's what those states unlock.
We've tried prompting around it. We've tried giving the agent a dedicated pass whose only job is to walk the graph. It works on small surfaces. Get above a few dozen state-mutating operations and the agent loses the topology. Whatever bug the agent finds at that point is usually one a careful reviewer would have caught on a first pass anyway. The interesting bugs, the ones that pay, sit deeper in the graph. Finding them requires holding more of the graph in your head than the agent will hold.
That isn't a complaint about the model. It's the most useful thing we've learned this year about where the human edge actually is.
What works
Build the graph. Not in a tool. In the review.
When you review an auth decision, ask which other nodes in the system can reach the node you're checking. From those, ask the same question. Keep walking until you stop finding new starting points. The findings are at the leaves.
If you can't do that walk in your head, your system has outgrown the people who maintain it. That's the real observation. Whatever vulnerability ships next quarter is downstream of it.
What we'd tell you not to do
Don't buy a tool that promises to do this for you. They don't.
Don't assume your linter or scanner is checking the chains. It's checking the annotations.
Don't treat new endpoints as feature work. The act of adding a new endpoint, by definition, adds an edge to the graph. That's a security event even when no annotation has changed.
We have not yet seen a non-trivial company that takes any of this seriously at the platform level. We've also not seen a non-trivial company that doesn't pay us for the findings that result. Those two facts are the same fact.