Classifying agent actions in microseconds: the 3-tier cascade that lets you check every tool call

Last week I wrote about a CSV agent that exfiltrated customer data and a small policy that stopped it. A few people asked the same follow-up offline — "sure, but how does the policy decide what to allow without slowing every tool call to a crawl?" — so this week is the engine post.

I would say this is the single decision most teams get wrong when they bolt security onto an agent. The default reach is "let's put an LLM judge in front of every tool call." It sounds elegant. It is also, in practice, the reason guardrails get ripped out two weeks after they are added — the agent becomes too slow, the bill becomes too high, and the team quietly moves on.

There is a cheaper way. Honestly it is not even clever, it is just discipline about what gets escalated to what.

The frame: cascade, not single-stage

I would model this as a 2*2 — coverage on one axis, cost on the other. Keyword and regex give you huge coverage at near-zero cost for the cases that are obvious (anything starting with requests.post is SEND, anything starting with shell.exec is EXECUTE). A registry gives you precise coverage for the named tools you actually use in your stack. An LLM gives you the long tail — the custom internal tool nobody documented, the new MCP server somebody installed yesterday — at LLM prices and LLM latency.

Coverage vs per-call cost for a 3-tier action classifier — Tier 1 keyword (cheap + broad) is the sweet spot, Tier 2 registry is precise and cheap for known tools, Tier 3 LLM is necessary only as a fallback for novel tools.

The mistake is putting one of those alone in front of every call. Keyword alone misses the long tail. LLM alone is what kills you on p99.

The cascade is — try the cheapest signal first, only escalate if the previous tier abstained:

Tier 1 — keyword/regex on the tool name. Microseconds. Catches HTTP verbs, shell exec, file writes, common DB mutations.
Tier 2 — registry lookup. Sub-microsecond dict lookup against a small set of known tools (LangChain community tools, MCP servers, your own internal ones). The fingerprint is pre-classified.
Tier 3 — LLM classifier. Only fires for the tool calls neither tier recognises. gpt-4o-mini with a 4-token output cap, one of {READ, WRITE, SEND, EXECUTE}.

I built it, ran a benchmark, and the numbers were sharper than I expected

I wrote a vendor-neutral reference engine (about 150 lines, link below) and ran 10,000 synthetic tool calls through it. The workload is mixed — common verbs, named registry tools, and a long tail of custom-sounding names that Tier 1 and Tier 2 cannot classify. For Tier 3 I capped real OpenAI calls at 100 so the LLM latency number is measured, not assumed.

Tier	What it does	Coverage	p50	p95	p99
1	Keyword / regex	78.25%	1.9 µs	4.8 µs	6.6 µs
2	Registry lookup	14.69%	0.7 µs	0.9 µs	1.1 µs
3	LLM (gpt-4o-mini)	7.06%	~1.07 s	~1.93 s	~15.9 s
End-to-end (weighted)	All three	100%	~1.8 µs	~1.07 s	~1.07 s

A few things worth pausing on.

The Tier 3 p50 is one full second for gpt-4o-mini. The p99 was about 16 seconds. That is not a "bad day at OpenAI" number — that is what you actually get when you call the API serially from a script today. And gpt-4o-mini is the cheap one. If you reach for gpt-4o or Claude for this, it gets worse.

The Tier 2 p50 of 0.7 µs is faster than Tier 1, which looks wrong at first, but the reason is mechanical — Tier 2 is a single dict lookup, Tier 1 is a regex loop. When Tier 1 matches, it does real work. When you compose them in the cascade, Tier 1 still runs first because it has higher coverage and zero registry-maintenance cost.

End-to-end p50 is microseconds. End-to-end p95 is a second. That cliff is the entire point of this post.

The cliff is what most teams miss

If 100% of your tool calls go through an LLM judge, your p50 is the LLM p50 — one second. Every call. With the 3-tier cascade, 93% of calls finish in single-digit microseconds and only the 7% novel ones pay the LLM cost.

But here is the part that is uncomfortable — the 7% does not disappear. It shows up in p95 and p99. If you genuinely cannot tolerate a one-second tail on any tool call, then "LLM as fallback" is still too slow, and you have to either (a) make Tier 2 wider so fewer calls escape, or (b) accept that novel tools get blocked by default and require a human to add a registry entry. Both are reasonable. "Just use the LLM" is the one that is not.

So a better question for a team that already has an agent in production is — how many distinct tool names did your agent call last week? If it is under 200, you can register most of them and your Tier 3 escape rate drops to almost nothing. If it is in the thousands and growing weekly, registry alone will not save you and you need to think harder about whether unknown tools should be allowed at all without a review.

What I would do differently next time

A couple of things I noticed building this.

First, the regex order in Tier 1 matters a lot. I had read_* matching before the HTTP-mutating-verb pattern at one point, and requests.post was getting classified as READ because of post matching nothing in the verb table but the broader pattern firing. Specific patterns must go first. Obvious in hindsight.

Second, Tier 3 is where I would most want a small fine-tuned classifier instead of a general LLM. The task is genuinely a 4-class classification of short strings. A 50-MB model running locally would beat gpt-4o-mini on latency by three orders of magnitude and on cost by infinity. For this post I used the LLM because the point was to show what most people actually run today.

Third, if you are doing this in production, cache aggressively at Tier 3 — the same novel tool name will appear again, and the second time it does, it should be a Tier 2 entry, not another LLM call. My reference engine does not do this on purpose, because I wanted the benchmark to reflect a worst-case Tier 3 cost.

Code and data

All of it is in the repo, runnable on your machine in about a minute (the Tier 1 + Tier 2 portion). The Tier 3 portion needs an OpenAI key and costs a fraction of a cent for 100 calls — https://github.com/suyog-trivedi/agent-attack-zoo/tree/main/action-classifier-3tier

If you re-run with your own workload, I would be curious what your tier-mix looks like. My 78 / 15 / 7 split is from a synthetic mix that I think mirrors single-agent / framework-tool setups reasonably well. Multi-agent and MCP-heavy stacks are likely very different — more diverse tool surface, fatter tail, more Tier 3.

Closing question — if you are running an LLM classifier in front of every tool call today, what does your actual p99 tool-call latency look like, and what fraction of those calls genuinely needed the LLM? Curious to hear the real numbers, not the ones we hope for.

Opinions are my own, not my employer's.