LLM and Bug Finding: Insights from a $2M Winning Team in the White House's AIxCC

>>garlic+(OP)
AIxCC is an AI Cyber Challenge launched by DARPA and ARPA-H.

Notably, a zero-day vulnerability in SQLite3 was discovered and patched during the AIxCC semifinals, demonstrating the potential of LLM-based approaches in bug finding.

replies(2): >>rfoo+On >>hypeat+vs1

>>garlic+(OP)
I'm part of the team, and we used LLM agents extensively for smart bug finding and patching. I'm happy to discuss some insights, and share all of the approaches after grand final :)

replies(4): >>doctor+Uh >>simonw+7j >>adrago+3k >>wslh+Ht

>>hqzhao+Mb
Everyone thinks bug bounties should be higher. How high should they be? Who should pay for them?

replies(2): >>hqzhao+Bj >>tptace+Wp

>>hqzhao+Mb
What kind of LLM agents did you use?

replies(1): >>hqzhao+ak

>>doctor+Uh
It really depends on the target and the quality of the vulnerability. For example, low-quality software on GitHub might not warrant high bug bounties, and that's understandable. However, critical components like KVM, ESXi, WebKit, etc., need to be taken much more seriously.

For vendor-specific software, the responsibility to pay should fall on the vendor. When it comes to open-source software, a foundation funded by the vendors who rely on it for core productivity would be ideal.

For high-quality vulnerabilities, especially those that can demonstrate exploitability without any prerequisites (e.g., zero-click remote jailbreaks), the bounties should be on par with those offered at competitions like Pwn2Own. :)

replies(4): >>tptace+1q >>doctor+Eq >>logica+dw >>77pt77+1A

>>hqzhao+Mb
Hey, congrats on getting to the finals of AIxCC!

Have you tested your CRS on weekend CTFs? I’m curious how well it’d be able to perform compared to other teams

replies(1): >>hqzhao+jl

>>simonw+7j
Based on popular pre-trained models like GPT-4, Claude Sonnet, and Gemini 1.5, we've built several agents designed to mimic the behaviors and habits of the experts on our team.

Our idea is straightforward: after a decade of auditing code and writing exploits, we've accumulated a wealth of experience. So, why not teach these agents to replicate what we do during bug hunting and exploit writing? Of course, the LLMs themselves aren't sufficient on their own, so we've integrated various program analysis techniques to augment the models and help the agents understand more complex and esoteric code.

replies(2): >>simonw+dt >>dogma1+DU

>>adrago+3k
Thanks!

We haven't tested it yet. Regarding CTFs, I have some experience. I'm a member of the Tea Deliverers CTF team, and I participated in the DARPA CGC CTF back in 2016 with team b1o0p.

There are a few issues that make it challenging to directly apply our AIxCC approaches to CTF challenges:

1. *Format Compatibility:* This year’s DEFCON CTF finals didn’t follow a uniform format. The challenges were complex and involved formats like a Lua VM running on a custom Verilog simulator. Our system, however, is designed for source code repositories like Git repos.

2. *Binary vs. Source Code:* CTFs are heavily binary-oriented, whereas AIxCC is focused on source code. In CTFs, reverse engineering binaries is often required, but our system isn’t equipped to handle that yet. We are, however, interested in supporting binary analysis in the future!

>>garlic+B
Notably, an undiscovered trivial NULL pointer dereference in SQLite3's SQL parser was discovered and patched. But yeah, it makes very good marketing material.

replies(1): >>hqzhao+5o

>>rfoo+On
It's not a critical issue, but it was surprising since we didn’t know that SQLite3 would be one of the challenges before the competition.

>>doctor+Uh
Who thinks bug bounties should be higher? Why? Everybody definitely does not think this.

replies(1): >>vasco+8L

>>hqzhao+Bj
Google and Apple bounties on zero-click remotes exceeds the prize amounts I see from Pwn2Own?

>>hqzhao+Bj
It seems really hard for people to like, name some vulnerabilities, name some prices. I'm glad you are playing along. Which scenario makes more sense:

    The Punchline: Microsoft pays $10m for vulnerabilities like the kind used to exploit SolarWinds and the Azure token audience vulnerability.

    The Status Quo: Thousands of people pay CrowdStrike a total of billions of dollars, in exchange for urgent patching when vulnerabilities become known.

Okay, do you see what I am getting at? On the one hand, if you pay bug bounties, the bugs get fixed, and they sure seem expensive. But if you look into how much money is spent on valueless security theatre, it is a total drop in the bucket. But CrowdStrike hires security researchers!

So what should the prices really be? For which vulnerabilities? The SolarWinds issue is probably worth more than $10m, if people are willing to pay 100x more to CrowdStrike for nothing.

replies(2): >>saagar+XC >>necove+JF

>>garlic+(OP)
The AIxcc booth felt like it was meant for a tradeshow as opposed to being a place where someone could learn something.

replies(1): >>hqzhao+ws

>>rocksk+5s
I heard that the AIxCC booth prepared the same challenges for the audience to solve manually, but I didn’t check the details.

I believe there will be even more cool stuff in next year’s grand final. If you want to get a sense of what to expect, check out the DARPA CGC from 2016. :)

replies(1): >>rocksk+ku

>>garlic+(OP)
BTW, have you seen the new LLMsic offensive tools such as XBOW [1]? They just received a founding round from Sequoia Capital [2].

[1] https://xbow.com/

[2] https://www.sequoiacap.com/article/partnering-with-xbow-the-...

>>hqzhao+ak
When you call these things “agents” what do you mean by that? Is this a system prompt combined with some defined tools, or is it a different definition?

replies(1): >>tinco+nK

>>hqzhao+Mb
Congrats! ELI5: what insights do you have NOW that were not published/researched extensively in academic papers and/or publicly discussed yet?

>>hqzhao+ws
I hope that booth is gone for good. Def Con doesn't need marketers with a blank check putting a booth there. Leave that garbage at Black Hat.

replies(1): >>rocksk+eK

>>hqzhao+Bj
p2o is pathetically low in comparison to other markets. is your experience limited to legitimate bug bounty programs like that?

>>hqzhao+Bj
> KVM, ESXi, WebKit, etc., need to be taken much more seriously.

Openssl

>>doctor+Eq
The real question here is who is willing to pay $10 million for such a bug.

replies(1): >>tptace+1M1

>>doctor+Eq
It's not as simple: those billions of dollars are not just for this particular issue, or even just for security support.

It's also a difference between keeping a software engineer on staff and hiring a contractor as needed. One is cheaper for the company even if the hourly rate is higher.

The better question is how we can improve the overall security of the software we write, which this article is more focused on. But we understand that there will be bugs, and security bugs even, no matter how hard we try.

Even DJB (of qmail fame) and Knuth (of TeX and TAOCP fame) pay out bug bounties, and they heavily focus on software correctness over large feature sets.

>>rocksk+ku
To clarify - I hope your "more cool stuff" doesn't mean more fog machines and LED strips. And some of the companies that seemed to ride DARPA's coattails there made my skin crawl. No slight on DARPA themselves.

>>simonw+dt
An agent in this context is software that does LLM prompt results to determine its next action, often looping to iteratively get to a good result.

>>tptace+Wp
There's always two or three people in every thread repeating the same thing without any understanding of marketplace dynamics. If you ask them how much should it be you also get wild answers that don't reflect reality.

>>garlic+(OP)
this is really impressive work. coverage guided and especially directed fuzing can be extremely difficult. its mentioned fuzzing is not a dumb technique. I think the classical idea is kind of dumb, in the sense of 'dumb fuzzers' but these days there is tons of intelligence built around it now aand poured into it, but i've always thought its now beyond the classic idea of fuzz testing. i had colleagues who poured their soul into trying to use git commit info etc. to try and help find potentially bad code paths and then coverage guided fuzzing trying to get in there. I really like the little note at the bottom about this. adding such layers kind of does make it lean towards machine learning nowadays, and id think perhaps fuzzing is not the right term anymore. i dont think many people are actually still simply generating random inputs and trying to crash programs like that.

this is really exciting new progress around this type of field guys. well done! cant wait to see what new tools and techniques will be yielded from all of this research.

Will you guys be open to implementing something around libafl++ perhaps? i remember we worked with that extensively. As a lot of shops use that already it might be cool to look at integration into such tools or would you think this deviates so far it'll amount to a new kind of tool entirely? Also, the work on datasets might be really valuable to other researchers. there was a mention of wasted work but labeled sets of data around cve, bug and patch commits can help a lot of folks if theres new data in there.

this kind of makes me miss having my head in this space :D cool stuff and massive congrats on being finalists. thanks for the extensive writeup!

>>hqzhao+ak
Are you going to publish your RAG strategy?

>>garlic+B
Is there any write ups or CVE pages on that vulnerability? From a quick search, I can't find anything.

>>saagar+XC
Nobody. That far exceeds the current market prices of the most in-demand bugs.

replies(1): >>doctor+CS1

>>tptace+1M1
What is this market you speak of? Can you link me to it and show me the prices you are talking about? The Microsoft key vulnerability leaked all the State Department emails, and probably a lot more. It could have been used to compromise a lot of Azure. What is comparable?

>>garlic+(OP)
What's the good word!!

zlacker

LLM and Bug Finding: Insights from a $2M Winning Team in the White House's AIxCC