
24 Feb 2025
Introducing BaxBench
LLM-generated applications are here. Some well-known tools now offer to turn anyone into an app developer, while others aim to make current developers more productive. Given the concerns about the security, support, and future development of these quickly made apps, we wanted to measure this with a benchmark. Our focus wasn’t on any specific tools but on the LLM models that power them. We present BaxBench - a benchmark of 392 instances – 28 scenarios for LLMs to implement using 14 different frameworks in 6 languages such as Django, ExpressJS, Flask, Ruby on Rails and others. For leaderboard and more benchmark details, together with the team at ETH Zurich we have created baxbench.com.
We conducted an analysis of backend applications generated by LLMs with a specific focus on assessing their vulnerability exposure to real security exploits. Backends are integral to many systems and vary in complexity; some are designed to manage the entire state of an application, while others are constructed by integrating multiple specialized services, known as microservices. Applications rely on one or more security critical backends to perform tasks such as handling logins, managing application states, and storing user data. To evaluate these systems, we developed BaxBench, which consists of small and frequently seen tasks for application backends. These backends are granted access to a database for storage, and large language models (LLMs) are tasked with generating their logic based on predefined OpenAPI specifications. Our findings revealed that many of these backends were not secure, as we were to execute actual attacks against them. This goes beyond mere analysis or tool-generated warnings about hypothetical security issues - we successfully executed real exploits, including SQL injection, path traversal, and user impersonation. It is crucial to emphasize that our specifications did not suggest any vulnerabilities; the vulnerabilities we exploited arose from the outputs generated by the LLMs.
One interesting point is that we can ask LLMs to fix these vulnerabilities, and they manage to solve many of them. They do best when we tell them exactly what we will be trying to exploit. However, even then, not every vulnerability goes away and there is a trade-off — when security issues are fixed, we measure that some apps stop working properly. This creates a big opportunity for tools like the ones we are building at LogicStar, which can both identify and fix these security issues. And of course, the benchmark is open source, so security and application development experts can help us add more scenarios or new ways to exploit vulnerabilities. In addition to this, we expect that LLMs will also get better thanks to benchmarks like BaxBench.
Looking deeper, it’s clear that correctness and security aren’t the only challenges. LLMs also struggle to create reliable code in different backend frameworks, especially those that aren’t the most popular one. Engineers see firsthand that LLMs can have trouble with complex and varied tasks, which means the results aren’t always perfect. Sometimes, even the best tools get stuck and can’t improve an app by just using more LLM prompts. However, to make progress, you have to start by measuring the problem. With BaxBench, we looked at security, and going forward—at LogicStar, we are focusing on checking and improving how well the models can understand existing apps, fix real problems in supporting them, and ultimately make their and your end users happier.
For more work on maintaining software, you can follow our research, blogs and social media. If you’re running an app with Python backends, we’d love to talk about how our early access product can help maintain that app by fixing bugs.
Need more information? Have a look at the paper and the baxbench.com website.
Please file issues or contribute to the benchmark code.