A lot of AI-agent discussions focus on whether the agent completed the task. But I think there is a missing category: the agent may complete the task, but do it in an unsafe or policy-violating way.
For example, an agent could finish the job but use the wrong tool, skip an approval step, expose private information, or take an action that should have been blocked.
In our ACM CAIS 2026 paper, we call this the Verifier Tax. The idea is to separate:
- safe success
- unsafe success
- failure
We studied this in tool-using LLM agent scenarios using τ-bench and proposed a two-tier verification architecture: deterministic checks first, then an LLM-based verifier for more contextual cases.
The main takeaway: verification can make agents safer by reducing unsafe success, but it may also reduce task completion as tasks get longer.
Paper: https://dl.acm.org/doi/full/10.1145/3786335.3813160
Curious what people think: if an AI agent completes a task but violates a safety rule, should that count as success or failure?
[link] [comments]