We are excited to share SWT-Bench, the first benchmark for reproducing bugs and validating their fixes based on (GitHub) issue descriptions. We presented SWT-Bench at two ICML workshops and want to thank everyone who stopped by for their interest, enthusiasm, and the great discussions we had. We now see a community trend to not only focus on fixing bugs but also generating tests that can effectively reproduce them and validate that proposed fixes truly resolve the issues. We believe this is essential for achieving truly autonomous bug fixing, which is what LogicStar delivers.
In our paper, we demonstrate how any code repair benchmark with a known ground truth solution can be transformed into a test generation and issue reproduction benchmark. There, the goal is to create a “reproducing test” that fails on the original codebase and passes after the ground truth fix has been applied. Our analysis shows that Code Agents excel in this task and outperform dedicated LLM-based test generation methods. Leveraging these tests for code repair further allows us to significantly enhance precision. To learn more, please check out our preprint paper).
LogicStar.AI builds on top of this research to achieve a truly autonomous bug fixing that you can trust as you trust your top engineers.