Claude Code Can Find Bugs. LogicStar Finds the Ones That Matter.
Code agents are great at investigating and fixing known bugs. But are they good at finding the important ones across a real codebase?
To answer this question, we ran a case study comparing Claude Code to LogicStar’s bug finder on a snapshot of our own codebase.
Key result: LogicStar found 2x more unique high-impact bugs than Claude Code with LogicStar-generated rules, at less than one-third the cost per unique high-impact bug.
Setup
We compared three different bug finders:
- LogicStar’s internal bug finding engine
- Vanilla Claude Code using the standard code-review skill
- Claude Code prompted with codebase knowledge and Bug Finding Rules provided by LogicStar
In all three settings, we used our bug validation engine to confirm whether the found bugs were true positives and assess their severity. This means the results are comparable across all settings. In our product, all bugs go through this validation before they are shown to customers.
To assess which approach works best, we considered six metrics:
- File coverage: what percentage of source files the approach inspected
- Surfaced issues: how many potential issues the approach reported
- Unique validated bugs: how many surfaced issues were confirmed as real bugs after deduplication
- Unique high-impact bugs: how many unique validated bugs were classified as high impact
- Total cost: token spend for one scan
- Cost per unique high-impact bug: token spend divided by unique high-impact bugs found
Results
LogicStar surfaced 40 issues across 43% of source files. After validation, 28 were confirmed as real bugs. Of those, 8 were high impact and 20 were medium impact. The remaining 12 surfaced issues were filtered out as false positives.
The scan used multiple specialized sub-agents and cost $47 in token spend.
Claude Code, in contrast, surfaced only 10 issues, 7 of which were bugs. None of them were high impact and only 3 were medium impact, with the rest being so low impact that we would filter them out instead of showing them to customers. It explored only 7% of the codebase, leading to a low cost of $5.
This is not unexpected. Code agents are not well suited for open, codebase-spanning tasks. They were designed and trained for concrete tasks that require locating the right part of the codebase as context and then executing. Bug finding requires scanning through most of the codebase and looking for anomalies while understanding the relevant context.
To help Claude Code, we prompted it with 20 of our Bug Finding Rules. These describe failure modes specific to the codebase under review, including the areas of the codebase where they are likely to appear. This significantly increased the depth of the scan, covering 29% of files and surfacing 166 issues at a cost of $78.
After validation and deduplication, 78 were unique real bugs, 4 of which were unique high-impact bugs.
Comparison
What the Metrics Do Not Capture
LogicStar learns from your feedback, both explicit and implicit. It observes which bugs you actually end up fixing and focuses on showing you more bugs of these types.
LogicStar not only filters out false positives, but also remembers what slipped through and got designated as false positive by you. It can avoid repeatedly showing the same false positives, while a standalone Claude Code scan does not retain this product-level feedback memory by default.
Conclusion
Vanilla code agents are not enough for production-grade open-ended bug finding. They cover only a small part of the codebase and find few high-impact issues.
To unlock better performance, code agents need to be instructed to look for specific patterns in the right parts of the codebase. However, they still find many false positives and only surface a small fraction of high-impact bugs. This results in a lot of noise to dig through and comes at a higher cost compared to more optimized solutions.
LogicStar’s specialized bug finding system produced the most high-impact bugs with the least noise, covering the largest part of the codebase at moderate cost. In addition, it comes with the validation layer required to filter out false positives and the memory layer to learn from feedback and avoid repeatedly showing the same false positives.
Try the no-login LogicStar demo.
Claude Code Can Find Bugs. LogicStar Finds the Ones That Matter.
LogicStar found 2x more unique high-impact bugs than Claude Code with LogicStar-generated rules, at less than one-third the cost per unique high-impact bug.

An anonymized medical application was preparing for wider hospital rollout.
The product had been built with high velocity. A significant portion of the application had been developed with AI coding agents, similar to how many new applications are now being built in 2026.
The workflows were in place.
The application appeared ready.
But not every line of code had been manually reviewed in depth. Not every trust boundary had been tested against the real operational model. Not every endpoint had been checked against how the application could fail once real staff, patients, schedules, permissions, exports, and clinical workflows were involved.
That is where production readiness risk usually appears.
Not in the obvious places.
It appears in the gaps between authentication and authorization. In service-role database access. In PDF ingestion assumptions. In patient ownership checks. In frontend session state. In timezone handling. In import and overwrite flows.
At a glance
The first 24 hours produced a concrete hardening cycle, not an abstract risk report.
The application owner fixed 23 production-relevant issues across frontend and backend workflows, with one finding reviewed and rejected as not applicable.

The result was not just a cleaner backlog.
It was a stronger release posture before broader exposure to hospital users.
The challenge
AI coding agents are changing how software is built.
They make it possible to generate large amounts of working product code quickly.
That is useful.
But faster software output does not automatically create production readiness.
In sensitive applications, the hard problems are often not visible in the UI. They are hidden in permission boundaries, role models, patient ownership checks, data mutation paths, session transitions, schedule handling, and operational edge cases.
For a medical application preparing for hospital rollout, those gaps matter.
They can affect privacy, auditability, clinical workflow integrity, staff trust, and release confidence.
What LogicStar surfaced
LogicStar identified issues across both backend and frontend workflows.
The findings clustered into six production-risk categories:
- Authorization and role separation, including staff-to-admin privilege escalation risk.
- Patient-linked data boundaries, including missing ownership checks and cross-context access paths.
- Clinical workflow integrity, including protocol and schedule handling issues.
- Data integrity, including partial-update and overwrite paths that could leave inconsistent state.
- Frontend session state, including stale or misleading authentication and user-context behavior.
- Release readiness, including issues that could create support escalations, emergency patches, or delayed rollout if found later.
This grouping matters because production risk is rarely caused by one isolated bug.
It usually appears when many small implementation assumptions meet real users, real data, real permissions, and real operational workflows.
Representative high-risk issue: staff could create admin accounts

One representative high-risk issue was a staff-to-admin privilege escalation bug. The issue was not complex. It was a trust-boundary mistake. A staff-only invitation endpoint checked whether the caller was staff. But it failed to check whether that staff user should be allowed to create administrators.
The endpoint accepted: role = admin
It then used a service-role database client to invite the user and assign the admin role. That meant the normal database permission layer could not block the escalation.
Authentication passed.
Authorization failed.
The practical result was serious: Any staff user who could call the endpoint could create a new full-admin account.
In a standard SaaS application, that is already a high-impact authorization bug. In a medical application, the risk is much larger. A full-admin account can potentially access sensitive operational workflows, patient-linked records, exports, configuration, staff administration, audit-relevant data, and internal system controls.
This kind of issue can quickly move from a software defect into an operational incident.
It can create:
- unauthorized administrative access
- exposure of sensitive medical or patient-linked data
- privacy investigation risk
- privacy, security, or breach-notification assessment
- emergency patching before or after rollout
- support escalations from clinical users
- delayed hospital deployment
- loss of trust in the application
- additional audit and release-governance work
This is the type of issue that should be fixed before wider release, not discovered after real users are already depending on the system.
The broader pattern
The privilege escalation issue was not isolated.
LogicStar also surfaced issues across backend and frontend workflows that reflected the real risk profile of an AI-built medical application moving toward production.
Examples included:
- A missing ownership check that could expose medication intake logs across patients.
- A server-rendered export page that could load patient metadata before the client-side admin gate ran.
- A protocol upload flow that could attach treatment schedules to the wrong patient.
- An overwrite flow that could mutate previous protocol rounds instead of the active one.
- Unchecked database mutation errors that could leave old and new protocol state coexisting.
- Fixed UTC+1 timestamping that could shift Zurich medication reminders during daylight saving time.
- PDF highlight scanning that could miss required blood-test or ultrasound monitoring items when the relevant table was not on the first page.
- Frontend session and refresh-token handling that could put users into incorrect local authentication states.
- Frontend consent and context flows that could bypass expected checks or expose stale state.
Each issue looks like an implementation detail in isolation.
Together, they represent the difference between:
“The application works in a demo.” And “The application is ready for real clinical use.”
Why this mattered in a medical software context
In medical software, production defects can become more than bugs. They can create access-control risk, privacy review risk, auditability gaps, release-governance concerns, and certification-readiness blockers.
These issues may matter for GDPR, HIPAA, MDR, FDA medical-device software readiness, or internal hospital risk, security and ethics review. LogicStar does not determine legal compliance or issue regulatory certification. It helps teams surface, prioritize, and remediate production-relevant issues earlier, creating stronger technical evidence for security reviews, privacy assessments, audit-readiness work, certification-readiness work, and release-governance decisions before wider rollouts or widespread incidets.
The goal is to reduce the chance that preventable software defects are first discovered by clinicians, patients, support teams, auditors, or incident responders.
What changed in the first 24 hours
The result was not an abstract risk report.
It was a concrete hardening cycle.
Within the first 24 hours:
- 23 issues were fixed
- 12 backend issues were fixed
- 11 frontend issues were fixed
- 1 issue was reviewed and rejected as not applicable
- authorization and data-integrity issues were addressed
- frontend session and workflow issues were corrected
- the application moved closer to production readiness before broader hospital rollout
The rejected issue was reviewed and dismissed with a clear explanation:
“There are no manual entries allowed. There is also no way to enter manual entries.”
That is the right outcome.
The point is not to maximize the number of findings.
The point is to identify issues that are real, explain why they matter, and help the application owner decide what deserves engineering attention before the product reaches a wider user base.
What could have happened without this hardening step
If these issues had reached wider hospital usage, the risk would not have been limited to engineering inconvenience.
The application could have faced:
- emergency fixes after launch
- staff confusion from incorrect permissions or stale session state
- patient data exposure investigation
- incorrect or missing clinical schedule items
- wrong-patient protocol linkage
- medication reminder timing errors
- audit trail inconsistency
- support burden during rollout
- delayed adoption by clinical teams
- privacy and security review escalation
The cheapest time to find these issues is before release and before users are impacted.
The lesson
AI coding increases software output.
But software output is not the same as production readiness.
In medical applications, hidden gaps in authorization, patient-linked data boundaries, state handling, and workflow logic can create real privacy, auditability, and clinical operations risk.
LogicStar helps teams identify and fix the issues that matter before wider release.
Faster shipping.
Fewer surprises.
Safer production rollouts.
Preparing an AI-built application for production?
LogicStar helps engineering teams identify and fix release-critical issues before users, customers, or operational teams absorb the risk.
Request a production-readiness review:
Hardening an AI-Built Medical Application Before Hospital Release
An anonymized medical application was preparing for wider hospital rollout after being built with substantial AI assistance. In the first 24 hours after LogicStar was set up, the application owner fixed 23 production-relevant issues across frontend and backend workflows.

We Study Where Agents Fail. Then We Design Around It.
AI coding agents are improving rapidly.
But writing code is only a small part of software engineering.
The harder questions are:
- Did the agent identify the right problem?
- Is the root cause correct?
- Is the fix actually safe?
- Will it work across a large codebase?
- Can we trust the evaluation?
- Does more context actually help?
At LogicStar, we believe the future of software engineering will be determined by answering these questions, not by generating more code.
That's why we spend significant effort studying where agents fail.
Over the last several years, our team has built a series of benchmarks, each focused on a different weakness of software engineering agents.
FixedBench (COLM 2026)
Failure mode: Action bias.
Agents often modify code even when the correct action is to do nothing. FixedBench studies whether agents can distinguish between code that is broken and code that is already correct.
SWT-Bench (NeurIPS 2024)
Failure mode: Verification.
Can agents reproduce real-world bugs and generate tests that prove a fix actually works?
BaxBench (ICML 2025 Spotlight)
Failure mode: Security.
Can agents build backend systems that are not only functional but secure?
CodeTaste (ICML 2026)
Failure mode: Repository-scale refactoring.
Can agents perform large-scale code transformations while preserving behavior and maintainability?
SWA-Bench (ICML 2025)
Failure mode: Evaluation.
How do we automatically generate realistic software engineering tasks that accurately measure agent performance?
AgentMDBench (NeurIPS 2026)
Failure mode: Context overload.
Do repository-level instruction files actually improve outcomes, or do they simply add more context without improving understanding?
A Common Pattern
Across all six benchmarks, we found the same pattern.
Agents are increasingly capable of writing code.
But software maintenance requires much more than code generation.
It requires investigation.
Verification.
Prioritization.
Architectural understanding.
And evidence.
This observation became the foundation of LogicStar.
Rather than treating maintenance as a code-generation problem, we treat it as a software understanding problem.
LogicStar correlates production signals, customer reports, code structure, historical changes, and runtime behavior to identify what actually matters.
Every issue is investigated.
Every fix is validated.
Every recommendation is grounded in evidence.
The result is not an agent that simply writes code.
It is a system designed around the known failure modes of software engineering agents.
Because the future of autonomous software engineering will not be decided by who generates the most code.
It will be decided by who makes the best decisions.
We Study Where Agents Fail. Then We Design Around It.
Most teams focus on what AI coding agents can do. We focus on where they fail. Explore the six research benchmarks that shaped LogicStar's approach to building reliable autonomous software engineering.
.png)
Claude Code Leak: 10+ Security Issues Found in Minutes
Claude Code was recently leaked. We analyzed it using LogicStar AI and found multiple severe security issues, including remote code execution and permission bypasses.
Key Findings
Out of 169 total issues surfaced: 73 were security vulnerabilities, 96 were non-security defects (logic errors, reliability issues, unsafe assumptions)Below are a few representative examples:
- Headless mode (even with read-only tools) allows Remote Code Execution without any prompt or warning in untrusted repositories:
- Headless mode (even with read-only tools) allows Remote Code Execution without any prompt or warning in untrusted repositories:
echo "summarize this repo" | claude -p --tools "Read"
- The Claude MCP server allows arbitrary file writes. An undocumented tool call parameter enables writing files anywhere on the filesystem, without any visibility to the user.
- Permission model gaps allow access to sensitive files. We found multiple bypasses, including Grep and Glob enabling path traversal despite explicit deny rules.
With all the hype around Claude Mythos, which was likely built and tested on Claude Code, we expected severe vulnerabilities to be difficult to find.
Instead, our bug finder surfaced 169 issues within minutes.
Importantly, not all of these issues matter equally. The challenge is not finding bugs, but identifying which ones actually impact real systems.
This highlights the gap between raw model capability and production-grade system safety.
What This Means
AI coding tools are no longer just generating code. They are executing it.
This introduces new classes of risk:
- hidden execution paths
- implicit trust in configuration
- fragile permission models
As AI-generated code increases development speed, the number of potential defects grows, but only a small subset actually matters in production.
Takeaways for Developers
- Do not run AI coding tools on untrusted repositories without sandboxing
- Do not assume “read-only” modes are safe
About LogicStar
LogicStar surfaces bugs in your software and identifies which ones actually matter by correlating them with customer complaints, production alerts, and real usage.
It does the investigation, root cause analysis, and validation so your team can focus on fixing, not triaging.
Try it here: https://logicstar.ai/
For a limited time, the first 20 bugs are on us.
We responsibly disclosed all the critical secruity issues above and more through Claude Code’s HackerOne program.
Claude Code Leak: 169 Issues Found in Minutes (73 Security, 96 Non-Security)
We analyzed the leaked Claude Code using LogicStar AI and found 10+ critical security issues, including remote code execution and permission bypasses. Learn what this means for developers.

At LogicStar, our mission is to build a platform for self-healing applications. This relies on a strong bug-fixing backbone and review system working hand in hand to produce high-quality code fixes where possible, while abstaining rather than proposing incorrect fixes. We are therefore excited to announce that we not only have the best test generation system (announced last week) but also reached the state-of-the-art in fix generation with 76.8% accuracy on SWE-Bench Verified, the most competitive benchmark for automated bug fixing. Combining these systems, we achieve 80% precision, i.e., if our agent proposes a code fix, it is ready to merge 8 out of 10 times.
We are particularly proud that we achieved these results with our cost-effective production system rather than an agent carefully tuned for SWE-Bench and too expensive to ever run on customer problems. To achieve this, our L* Agent v1 leverages only the cost-effective OpenAI GPT-5 and GPT-5-mini, breaks down the bug fixing problem into clear sub-problems, and then orchestrates multiple sub-agents to investigate, reproduce, and fix the issue, before carefully reviewing and testing the generated code fix. All of this is enabled by our agent’s unique codebase understanding, powered by proprietary static analysis.
So, how does our L* Agent work and why is it so cost-effective? The main insight is to combine a strong model (GPT-5), generating baseline patches and tests, with diverse cheaper agents based on GPT-5-mini, to increase diversity before picking the best patch using our state-of-the-art tests. All of this is enabled by our static-analysis-powered codebase understanding, which boosts the performance of both the weak and strong models.
We prioritize correctness and validation over speed, processing all issues asynchronously, as soon as they appear in your bug backlog or observability. This approach ensures you don’t have to waste time manually triaging and reviewing issues but simply receive high-quality patches from LogicStar for the issues we can solve confidently. We are now turning this technology into a loveable product, and invite you to sign up as a design partner if you’d like to help us build a system that will reliably maintain your code. While SWE-Bench is an important benchmark, it’s only part of the story — we are developing our agents for real-world use and not only benchmarks, so be sure to follow us for more updates.
SWE-Bench Verified – Best Fix Generation at 76.8%
The L* agent achieves state-of-the-art results on SWE-Bench Verified using an ensemble of cheap agents and strong validation




.avif)





.avif)


























