Towards Identifying the Economics and Efficiency of Fuzzers vs. Agents
• Mike Shema

Courtesy British Library (1875.c.19)
Agents and LLMs have gained favor as the method for finding flaws, but how would we measure their economics and efficiency against a decade of successful fuzzing? As methods for bug hunting, they're neither mutually exclusive nor so overlapping as to be redundant. So how would we design a process for deciding which one to run and when?
Fuzzing has had a great success! "As of May 2025, OSS-Fuzz has helped identify and fix over 13,000 vulnerabilities and 50,000 bugs across 1,000 projects."1
I've always loved fuzzing as a way to find software quality problems. Some of those problems have security impacts, others are implementation mistakes. All of them are crashes that should be fixed. They have a high signal on the quality spectrum.
Megacycles and Megavolume
In the past six months or so, we've seen a big attention shift to LLMs finding flaws across open source projects from the Linux kernel to memory misuse in C-based projects to fun findings in Vim and Emacs. We've shifted from burning CPU cycles for fuzzing to burning GPU cycles for agents.
Clearly, the UX and onboarding steps to run an agent against a codebase is far superior to using a fuzzer -- write a sentence or two and you're done. I'll never diminish the importance of UX for any tool, especially in security.
But it still makes me wonder about how to evaluate the economics and efficiency of running a fuzzer vs. an agent (or collection thereof) against a codebase. There's a one-time investment in instrumenting a project with a fuzzer, followed by much lower maintenance and letting it run. And the nature of fuzzing is more likely to trigger memory safety issues, although it still has the potential for other classes of vulns like path traversal and security boundaries with weak logic.
Megacost or Microexpense
Is there any research on cost comparisons of fuzzing vs. LLMs? Any good papers on token costs related to running agents as code reviewers? I've tracked a few articles about CTF-style research by agents that puts token costs at around ~$10 per run per file (with ~3-4 runs to guarantee a finding) and the average AIxCC costs at around $152 per competition task.
A critical step in an evaluation of efficiency would be to normalize that cost between agents and fuzzing. Using per repo is too coarse. Per file or per LOC might be better since it's more granular. But AIxCC's per task might be best in terms of findings, assuming that "task" can be sufficiently defined. It's also imporant to note that AIxCC had several mixed approaches, from "AI-first with traditional validation" to "systems rooted in fuzzing...and enhanced them with LLMs."2
I'd love to find any updated resource or references on this economic aspect of agents. Let me know where I should be looking!
Discovery vs. Analysis
I noted that fuzzers have high signal. If they cause an app to crash, that's a bug to be fixed. But that bug isn't necessarily one that impacts security (aside from the generic availability problem of crashing). LLMs, on the other hand, have the potential to craft an exploit for a bug. Having an exploit adds context that helps prioritize and better understand the consequences of a bug.
But here's where I'd also distinguish what audience is taking in that context and what action they expect to take. There's a difference between an org trying to figure out how to keep thousands of dependencies up to date and a project owner improving their own code quality. Not that project owners don't have their own priorities and time pressures, but sometimes a bug takes less time to fix than it does it thoroughly analyze it.
I'd rather development teams focus on fixing bugs and refactoring their architecture to reduce their attack surface and eliminate classes of vulns. Yet I begrudgingly acknowledge that security teams want some sort of analysis about bugs. Not that they always need such an analysis, but they sure seem to want them from CVEs.
Where I'll most closely watch the discovery vs. analysis distinction is in the Linux kernel. The kernel devs have a very specific attitude towards bugs:
...due to the layer at which the Linux kernel is in a system, almost any bug might be exploitable to compromise the security of the kernel, but the possibility of exploitation is often not evident when the bug is fixed. Because of this, the CVE assignment team is overly cautious and assign CVE numbers to any bugfix that they identify.
Criteria and Considerations
So after all this preamble, which one wins? What does it even mean to win? How do we design a process that's cost-effective and efficient with fuzzers and agents and tokens and processors?
I no longer toy with the hypothesis that fuzzing is cycle-for-cycle more cost-effective than agents at discovering bugs. It feels like the ship has sailed in terms of agents being embraced for security.
Thus, I'll combine discovery and analysis and reframe the question to, "What's a cost-effective method of identifying software quality issues?"
- What are the operator costs to establish and maintain a harness for fuzzing, for agents?
- What's the operator UX for using a fuzzer, an agent?
- What are the CPU costs to execute a fuzzer, an agent?
- What is the cost per LOC? Is LOC even a good denominator?
- What constraints detract from either approach's success? e.g. a fully compiling app for fuzzers vs. a single file (or even PR!?) for agents, code complexity that fuzzing is mostly agnostic to vs. context windows for agents, programming languages and compiled vs. interpreted languages
Notably, my personal criteria cares less about volume (although I do care about variety of vuln classes), because I want to avoid the trap of maximizing a CVE count. Chasing vulns without a strategy to avoid them is just BugOps.
Quality as Consequence
All of the previous criteria are about economics and efficiency of finding bugs. But my actual success criteria boils down to a simpler motivating question:
"What fosters better software design that improves code quality and reduces the prevalence of vuln classes?"
Finding bugs important, exploiting them can be fun, but for me the most rewarding thing is preventing them in the first place.
(Post adapted from my original one on LinkedIn on April 6, 2026.)
From the "Trophies" section at https://github.com/google/oss-fuzz. Sadly, it hasn't been updated for 2026. ↩︎