June 12, 2026
-
time
min read

Claude Code Can Find Bugs. LogicStar Finds the Ones That Matter.

Code agents are great at investigating and fixing known bugs. But are they good at finding the important ones across a real codebase?

To answer this question, we ran a case study comparing Claude Code to LogicStar’s bug finder on a snapshot of our own codebase.

Key result: LogicStar found 2x more unique high-impact bugs than Claude Code with LogicStar-generated rules, at less than one-third the cost per unique high-impact bug.

Setup

We compared three different bug finders:

  • LogicStar’s internal bug finding engine
  • Vanilla Claude Code using the standard code-review skill
  • Claude Code prompted with codebase knowledge and Bug Finding Rules provided by LogicStar

In all three settings, we used our bug validation engine to confirm whether the found bugs were true positives and assess their severity. This means the results are comparable across all settings. In our product, all bugs go through this validation before they are shown to customers.

To assess which approach works best, we considered six metrics:

  1. File coverage: what percentage of source files the approach inspected
  2. Surfaced issues: how many potential issues the approach reported
  3. Unique validated bugs: how many surfaced issues were confirmed as real bugs after deduplication
  4. Unique high-impact bugs: how many unique validated bugs were classified as high impact
  5. Total cost: token spend for one scan
  6. Cost per unique high-impact bug: token spend divided by unique high-impact bugs found

Results

LogicStar surfaced 40 issues across 43% of source files. After validation, 28 were confirmed as real bugs. Of those, 8 were high impact and 20 were medium impact. The remaining 12 surfaced issues were filtered out as false positives.

The scan used multiple specialized sub-agents and cost $47 in token spend.

Claude Code, in contrast, surfaced only 10 issues, 7 of which were bugs. None of them were high impact and only 3 were medium impact, with the rest being so low impact that we would filter them out instead of showing them to customers. It explored only 7% of the codebase, leading to a low cost of $5.

This is not unexpected. Code agents are not well suited for open, codebase-spanning tasks. They were designed and trained for concrete tasks that require locating the right part of the codebase as context and then executing. Bug finding requires scanning through most of the codebase and looking for anomalies while understanding the relevant context.

To help Claude Code, we prompted it with 20 of our Bug Finding Rules. These describe failure modes specific to the codebase under review, including the areas of the codebase where they are likely to appear. This significantly increased the depth of the scan, covering 29% of files and surfacing 166 issues at a cost of $78.

After validation and deduplication, 78 were unique real bugs, 4 of which were unique high-impact bugs.

Comparison

Approach Files covered Surfaced issues Unique validated bugs Unique high-impact bugs Total cost Cost per unique high-impact bug
LogicStar 43% 40 28 8 $47 $5.88
Vanilla Claude Code 7% 10 7 0 $5 N/A
Claude Code + LogicStar-generated rules 29% 166 78 4 $78 $19.50

What the Metrics Do Not Capture

LogicStar learns from your feedback, both explicit and implicit. It observes which bugs you actually end up fixing and focuses on showing you more bugs of these types.

LogicStar not only filters out false positives, but also remembers what slipped through and got designated as false positive by you. It can avoid repeatedly showing the same false positives, while a standalone Claude Code scan does not retain this product-level feedback memory by default.

Conclusion

Vanilla code agents are not enough for production-grade open-ended bug finding. They cover only a small part of the codebase and find few high-impact issues.

To unlock better performance, code agents need to be instructed to look for specific patterns in the right parts of the codebase. However, they still find many false positives and only surface a small fraction of high-impact bugs. This results in a lot of noise to dig through and comes at a higher cost compared to more optimized solutions.

LogicStar’s specialized bug finding system produced the most high-impact bugs with the least noise, covering the largest part of the codebase at moderate cost. In addition, it comes with the validation layer required to filter out false positives and the memory layer to learn from feedback and avoid repeatedly showing the same false positives.

Try the no-login LogicStar demo.

Claude Code Can Find Bugs. LogicStar Finds the Ones That Matter.

LogicStar found 2x more unique high-impact bugs than Claude Code with LogicStar-generated rules, at less than one-third the cost per unique high-impact bug.

Read more
June 5, 2026
-
time
min read

An anonymized medical application was preparing for wider hospital rollout.

The product had been built with high velocity. A significant portion of the application had been developed with AI coding agents, similar to how many new applications are now being built in 2026.

The workflows were in place.

The application appeared ready.

But not every line of code had been manually reviewed in depth. Not every trust boundary had been tested against the real operational model. Not every endpoint had been checked against how the application could fail once real staff, patients, schedules, permissions, exports, and clinical workflows were involved.

That is where production readiness risk usually appears.

Not in the obvious places.

It appears in the gaps between authentication and authorization. In service-role database access. In PDF ingestion assumptions. In patient ownership checks. In frontend session state. In timezone handling. In import and overwrite flows.

At a glance

The first 24 hours produced a concrete hardening cycle, not an abstract risk report.

The application owner fixed 23 production-relevant issues across frontend and backend workflows, with one finding reviewed and rejected as not applicable.

LogicStar metric strip showing 23 production-relevant issues fixed in the first 24 hours, including 12 backend issues, 11 frontend issues, and one finding rejected as not applicable.
Figure 1: At-a-glance results from the pre-release hardening cycle. Within the first 24 hours after LogicStar was set up, the application owner fixed 23 production-relevant issues across backend and frontend workflows.

The result was not just a cleaner backlog.

It was a stronger release posture before broader exposure to hospital users.

The challenge

AI coding agents are changing how software is built.

They make it possible to generate large amounts of working product code quickly.

That is useful.

But faster software output does not automatically create production readiness.

In sensitive applications, the hard problems are often not visible in the UI. They are hidden in permission boundaries, role models, patient ownership checks, data mutation paths, session transitions, schedule handling, and operational edge cases.

For a medical application preparing for hospital rollout, those gaps matter.

They can affect privacy, auditability, clinical workflow integrity, staff trust, and release confidence.

What LogicStar surfaced

LogicStar identified issues across both backend and frontend workflows.

The findings clustered into six production-risk categories:

  • Authorization and role separation, including staff-to-admin privilege escalation risk.
  • Patient-linked data boundaries, including missing ownership checks and cross-context access paths.
  • Clinical workflow integrity, including protocol and schedule handling issues.
  • Data integrity, including partial-update and overwrite paths that could leave inconsistent state.
  • Frontend session state, including stale or misleading authentication and user-context behavior.
  • Release readiness, including issues that could create support escalations, emergency patches, or delayed rollout if found later.

This grouping matters because production risk is rarely caused by one isolated bug.

It usually appears when many small implementation assumptions meet real users, real data, real permissions, and real operational workflows.

Representative high-risk issue: staff could create admin accounts

Anonymized LogicStar dashboard showing a representative staff-to-admin privilege escalation issue found in a medical application before hospital release.
Figure 2: A representative high-risk issue surfaced during hardening: a staff-only invitation flow could create full admin accounts because the endpoint accepted role = admin.

One representative high-risk issue was a staff-to-admin privilege escalation bug. The issue was not complex. It was a trust-boundary mistake. A staff-only invitation endpoint checked whether the caller was staff. But it failed to check whether that staff user should be allowed to create administrators.

The endpoint accepted: role = admin

It then used a service-role database client to invite the user and assign the admin role. That meant the normal database permission layer could not block the escalation.

Authentication passed.

Authorization failed.

The practical result was serious: Any staff user who could call the endpoint could create a new full-admin account.

In a standard SaaS application, that is already a high-impact authorization bug. In a medical application, the risk is much larger. A full-admin account can potentially access sensitive operational workflows, patient-linked records, exports, configuration, staff administration, audit-relevant data, and internal system controls.

This kind of issue can quickly move from a software defect into an operational incident.

It can create:

  • unauthorized administrative access
  • exposure of sensitive medical or patient-linked data
  • privacy investigation risk
  • privacy, security, or breach-notification assessment
  • emergency patching before or after rollout
  • support escalations from clinical users
  • delayed hospital deployment
  • loss of trust in the application
  • additional audit and release-governance work

This is the type of issue that should be fixed before wider release, not discovered after real users are already depending on the system.

The broader pattern

The privilege escalation issue was not isolated.

LogicStar also surfaced issues across backend and frontend workflows that reflected the real risk profile of an AI-built medical application moving toward production.

Examples included:

  • A missing ownership check that could expose medication intake logs across patients.
  • A server-rendered export page that could load patient metadata before the client-side admin gate ran.
  • A protocol upload flow that could attach treatment schedules to the wrong patient.
  • An overwrite flow that could mutate previous protocol rounds instead of the active one.
  • Unchecked database mutation errors that could leave old and new protocol state coexisting.
  • Fixed UTC+1 timestamping that could shift Zurich medication reminders during daylight saving time.
  • PDF highlight scanning that could miss required blood-test or ultrasound monitoring items when the relevant table was not on the first page.
  • Frontend session and refresh-token handling that could put users into incorrect local authentication states.
  • Frontend consent and context flows that could bypass expected checks or expose stale state.

Each issue looks like an implementation detail in isolation.

Together, they represent the difference between:

“The application works in a demo.” And “The application is ready for real clinical use.”

Why this mattered in a medical software context

In medical software, production defects can become more than bugs. They can create access-control risk, privacy review risk, auditability gaps, release-governance concerns, and certification-readiness blockers.

These issues may matter for GDPR, HIPAA, MDR, FDA medical-device software readiness, or internal hospital risk, security and ethics review. LogicStar does not determine legal compliance or issue regulatory certification. It helps teams surface, prioritize, and remediate production-relevant issues earlier, creating stronger technical evidence for security reviews, privacy assessments, audit-readiness work, certification-readiness work, and release-governance decisions before wider rollouts or widespread incidets.

The goal is to reduce the chance that preventable software defects are first discovered by clinicians, patients, support teams, auditors, or incident responders.

What changed in the first 24 hours

The result was not an abstract risk report.

It was a concrete hardening cycle.

Within the first 24 hours:

  • 23 issues were fixed
  • 12 backend issues were fixed
  • 11 frontend issues were fixed
  • 1 issue was reviewed and rejected as not applicable
  • authorization and data-integrity issues were addressed
  • frontend session and workflow issues were corrected
  • the application moved closer to production readiness before broader hospital rollout

The rejected issue was reviewed and dismissed with a clear explanation:

“There are no manual entries allowed. There is also no way to enter manual entries.”

That is the right outcome.

The point is not to maximize the number of findings.

The point is to identify issues that are real, explain why they matter, and help the application owner decide what deserves engineering attention before the product reaches a wider user base.

What could have happened without this hardening step

If these issues had reached wider hospital usage, the risk would not have been limited to engineering inconvenience.

The application could have faced:

  • emergency fixes after launch
  • staff confusion from incorrect permissions or stale session state
  • patient data exposure investigation
  • incorrect or missing clinical schedule items
  • wrong-patient protocol linkage
  • medication reminder timing errors
  • audit trail inconsistency
  • support burden during rollout
  • delayed adoption by clinical teams
  • privacy and security review escalation

The cheapest time to find these issues is before release and before users are impacted.

The lesson

AI coding increases software output.

But software output is not the same as production readiness.

In medical applications, hidden gaps in authorization, patient-linked data boundaries, state handling, and workflow logic can create real privacy, auditability, and clinical operations risk.

LogicStar helps teams identify and fix the issues that matter before wider release.

Faster shipping.

Fewer surprises.

Safer production rollouts.

Preparing an AI-built application for production?

LogicStar helps engineering teams identify and fix release-critical issues before users, customers, or operational teams absorb the risk.

Request a production-readiness review:

request@logicstar.ai

Hardening an AI-Built Medical Application Before Hospital Release

An anonymized medical application was preparing for wider hospital rollout after being built with substantial AI assistance. In the first 24 hours after LogicStar was set up, the application owner fixed 23 production-relevant issues across frontend and backend workflows.

Read more
June 5, 2026
-
time
min read

We Study Where Agents Fail. Then We Design Around It.

AI coding agents are improving rapidly.

But writing code is only a small part of software engineering.

The harder questions are:

  • Did the agent identify the right problem?
  • Is the root cause correct?
  • Is the fix actually safe?
  • Will it work across a large codebase?
  • Can we trust the evaluation?
  • Does more context actually help?

At LogicStar, we believe the future of software engineering will be determined by answering these questions, not by generating more code.

That's why we spend significant effort studying where agents fail.

Over the last several years, our team has built a series of benchmarks, each focused on a different weakness of software engineering agents.

FixedBench (COLM 2026)

Failure mode: Action bias.

Agents often modify code even when the correct action is to do nothing. FixedBench studies whether agents can distinguish between code that is broken and code that is already correct.

SWT-Bench (NeurIPS 2024)

Failure mode: Verification.

Can agents reproduce real-world bugs and generate tests that prove a fix actually works?

BaxBench (ICML 2025 Spotlight)

Failure mode: Security.

Can agents build backend systems that are not only functional but secure?

CodeTaste (ICML 2026)

Failure mode: Repository-scale refactoring.

Can agents perform large-scale code transformations while preserving behavior and maintainability?

SWA-Bench (ICML 2025)

Failure mode: Evaluation.

How do we automatically generate realistic software engineering tasks that accurately measure agent performance?

AgentMDBench (NeurIPS 2026)

Failure mode: Context overload.

Do repository-level instruction files actually improve outcomes, or do they simply add more context without improving understanding?

A Common Pattern

Across all six benchmarks, we found the same pattern.

Agents are increasingly capable of writing code.

But software maintenance requires much more than code generation.

It requires investigation.

Verification.

Prioritization.

Architectural understanding.

And evidence.

This observation became the foundation of LogicStar.

Rather than treating maintenance as a code-generation problem, we treat it as a software understanding problem.

LogicStar correlates production signals, customer reports, code structure, historical changes, and runtime behavior to identify what actually matters.

Every issue is investigated.

Every fix is validated.

Every recommendation is grounded in evidence.

The result is not an agent that simply writes code.

It is a system designed around the known failure modes of software engineering agents.

Because the future of autonomous software engineering will not be decided by who generates the most code.

It will be decided by who makes the best decisions.

We Study Where Agents Fail. Then We Design Around It.

Most teams focus on what AI coding agents can do. We focus on where they fail. Explore the six research benchmarks that shaped LogicStar's approach to building reliable autonomous software engineering.

Read more
April 17, 2026
-
time
min read

Claude Code Leak: 10+ Security Issues Found in Minutes

Claude Code was recently leaked. We analyzed it using LogicStar AI and found multiple severe security issues, including remote code execution and permission bypasses.

Key Findings

Out of 169 total issues surfaced: 73 were security vulnerabilities, 96 were non-security defects (logic errors, reliability issues, unsafe assumptions)Below are a few representative examples:

  1. Headless mode (even with read-only tools) allows Remote Code Execution without any prompt or warning in untrusted repositories:
  2. Headless mode (even with read-only tools) allows Remote Code Execution without any prompt or warning in untrusted repositories:

              echo "summarize this repo" | claude -p --tools "Read"

  1. The Claude MCP server allows arbitrary file writes. An undocumented tool call parameter enables writing files anywhere on the filesystem, without any visibility to the user.
  2. Permission model gaps allow access to sensitive files. We found multiple bypasses, including Grep and Glob enabling path traversal despite explicit deny rules.

With all the hype around Claude Mythos, which was likely built and tested on Claude Code, we expected severe vulnerabilities to be difficult to find.

Instead, our bug finder surfaced 169 issues within minutes.

Importantly, not all of these issues matter equally. The challenge is not finding bugs, but identifying which ones actually impact real systems.

This highlights the gap between raw model capability and production-grade system safety.

What This Means

AI coding tools are no longer just generating code. They are executing it.

This introduces new classes of risk:

  • hidden execution paths
  • implicit trust in configuration
  • fragile permission models

As AI-generated code increases development speed, the number of potential defects grows, but only a small subset actually matters in production.

Takeaways for Developers

  • Do not run AI coding tools on untrusted repositories without sandboxing
  • Do not assume “read-only” modes are safe

About LogicStar

LogicStar surfaces bugs in your software and identifies which ones actually matter by correlating them with customer complaints, production alerts, and real usage.

It does the investigation, root cause analysis, and validation so your team can focus on fixing, not triaging.

Try it here: https://logicstar.ai/
For a limited time, the first 20 bugs are on us.

We responsibly disclosed all the critical secruity issues above and more through Claude Code’s HackerOne program.

Claude Code Leak: 169 Issues Found in Minutes (73 Security, 96 Non-Security)

We analyzed the leaked Claude Code using LogicStar AI and found 10+ critical security issues, including remote code execution and permission bypasses. Learn what this means for developers.

Read more
September 27, 2025
-
time
min read

At LogicStar, our mission is to build a platform for self-healing applications. This relies on a strong bug-fixing backbone and review system working hand in hand to produce high-quality code fixes where possible, while abstaining rather than proposing incorrect fixes. We are therefore excited to announce that we not only have the best test generation system (announced last week) but also reached the state-of-the-art in fix generation with  76.8% accuracy on SWE-Bench Verified, the most competitive benchmark for automated bug fixing. Combining these systems, we achieve 80% precision, i.e., if our agent proposes a code fix, it is ready to merge 8 out of 10 times.

We are particularly proud that we achieved these results with our cost-effective production system rather than an agent carefully tuned for SWE-Bench and too expensive to ever run on customer problems. To achieve this, our L* Agent v1 leverages only the cost-effective OpenAI GPT-5 and GPT-5-mini, breaks down the bug fixing problem into clear sub-problems, and then orchestrates multiple sub-agents to investigate, reproduce, and fix the issue, before carefully reviewing and testing the generated code fix. All of this is enabled by our agent’s unique codebase understanding, powered by proprietary static analysis. 

So, how does our L* Agent work and why is it so cost-effective? The main insight is to combine a strong model (GPT-5), generating baseline patches and tests, with diverse cheaper agents based on GPT-5-mini, to increase diversity before picking the best patch using our state-of-the-art tests. All of this is enabled by our static-analysis-powered codebase understanding, which boosts the performance of both the weak and strong models.

We prioritize correctness and validation over speed, processing all issues asynchronously, as soon as they appear in your bug backlog or observability. This approach ensures you don’t have to waste time manually triaging and reviewing issues but simply receive high-quality patches from LogicStar for the issues we can solve confidently. We are now turning this technology into a loveable product, and invite you to sign up as a design partner if you’d like to help us build a system that will reliably maintain your code. While SWE-Bench is an important benchmark, it’s only part of the story — we are developing our agents for real-world use and not only benchmarks, so be sure to follow us for more updates.

SWE-Bench Verified – Best Fix Generation at 76.8%

The L* agent achieves state-of-the-art results on SWE-Bench Verified using an ensemble of cheap agents and strong validation

Read more

All news

June 19, 2026
-
time

A field-software company for the building trades runs an app that construction crews use on real job sites, offline and on the move. For software like this, reliability is not a feature, it is the product. Here is how LogicStar found a vulnerability in the authentication flow and the root causes behind real production failures, across the mobile app and the backend, and fixed them before the next worker hit them.

At a glance

  • 27: security and reliability defects fixed across the mobile app and backend, one of them an authentication vulnerability closed before it could be exploited.
  • 0: dedicated QA engineers on the team. LogicStar was the safety net.
  • 0: findings the team assessed as simply wrong.
  • Paid: converted from a trial to a paid subscription.

The challenge: your user will not retry

The company replaces paper, phone calls, and group chats for construction crews. A worker uses the app to log a task, document a defect, or pull the latest plans, directly from the site. Adoption depends on one thing above all: it has to work the first time, every time, under real field conditions.

That is what makes reliability essential. Field conditions leave no room for retries or error messages: a worker on a scaffold or in a basement with no signal will not stop to re-attempt a failed action or file a support ticket. If the app fails mid-report, the record is simply gone, unrecoverable when it is needed as evidence in a payment dispute or a defect claim. One failure in front of a crew can lose the account.

And the team is lean: around ten engineers covering a large and growing product surface without a dedicated QA function. That is exactly where reliability defects accumulate faster than a small team can find them by hand, especially in a mobile codebase where the same code behaves differently on Android than on iOS.

How LogicStar helped

Working across the mobile app and the backend, LogicStar surfaced security and reliability defects that mapped directly to real field failure modes, not to theoretical edge cases.

  • Identity that could be spoofed. A login path that did not properly validate client-supplied data, which made it possible to impersonate another user.
  • Failures that only fire in the field. Unhandled errors in push-notification setup, file cleanup, and the app's primary data-capture path: the core flows a worker relies on, often with little or no signal.
  • State that broke under real use. Uploads stuck after a swallowed lookup miss, and a second upload failing because an earlier one had already been cancelled.

Each finding came with an impact assessment and a full investigation of its root cause, context the team's existing alerts did not provide.

Representative issue: a vulnerability in the authentication flow

The most important finding of the trial was in the authentication mechanism. The backend was not validating some client-supplied data, which could have allowed one account to be accessed as another. There is no evidence it was ever exploited. This is the kind of defect that never shows up as a crash and never appears in a demo. It just sits in the code as an open door until someone finds it. It was found and fixed inside the trial.

"One big win was that the system found an issue in our authentication mechanism. That is something we would not have discovered on our own, and we were able to fix it quickly."

The customer's engineering manager

The reliability defects were already hurting real users

The mobile findings were not hypothetical. A push-notification setup that failed without being caught was tied to real production errors across active users. A file-deletion crash showed up on Android, the platform most common on construction sites. A failure in the app's primary data-capture path is not a minor glitch when that path is the main way workers enter information. LogicStar did not just flag that these errors existed in monitoring; it found the latent code paths behind them and delivered the fixes.

Why this matters for field and deskless software

Software used in the field carries weight that office software does not. The record is contractual. The conditions are hostile. The user has no patience for friction and no path to recovery when something fails. A crash is not an inconvenience. It is lost data and lost trust, and trust with a hands-on workforce does not come back easily.

LogicStar does not replace a team's judgment about what to build. It gives a lean team a way to surface, prioritize, and fix the reliability and security issues that would otherwise reach a worker on site, where there is no second chance to get it right.

Outcome

  • Twenty-seven security and reliability issues found by LogicStar were fixed across the mobile app and the backend, including the authentication vulnerability, which was closed before it could be exploited.
  • The on-call engineer assessed the large majority of surfaced issues as real and important, and noted that none of the findings were simply wrong.
  • The customer converted from the trial to a paid subscription.

The lesson

AI-assisted development lets a small team ship a large, capable product. It does not make that product reliable or secure in the hands of a user who cannot tolerate failure. LogicStar fills that gap, and acts as a safety net for the issues a team would otherwise only find once a worker hits them on site.

Fewer field failures. Crashes fixed at the source. A safety net for a team without a QA function.

Do your users depend on software that has to work the first time?

LogicStar finds, reproduces, and fixes latent issues across your mobile app and backend, and delivers them as reviewed pull requests. See what it surfaces in your code: support@logicstar.ai.

min read
When a Crash Is Not a Ticket, It Is a Lost Record

Field crews use the app offline, on site, where a crash means a lost record. How LogicStar found the bugs that mattered, including an authentication vulnerability, and turned production noise into root-cause fixes.

Read more
June 16, 2026
-
time

A developer-infrastructure company sells a product whose entire promise is fidelity: code that passes against it should behave identically in production. Here is how LogicStar finds the correctness gaps that hand-written tests and code review miss across a surface too large to cover by hand, and ships them as fixes the team's own maintainers merge.

At a glance

  • 6: of the product's highest-traffic areas where correctness gaps were surfaced.
  • 4.2h: from a finding raised to triaged and confirmed, against the hours a maintainer would spend reproducing a divergence by hand.
  • Merged: fixes committed by the customer's own maintainers, into their own codebase.

The challenge: correctness is the product

The company lets engineering teams reproduce a production environment on their own machines. A developer points their tooling at a local endpoint, and the product responds exactly as the real system would. Customers do not use it as a convenience. They use it as a trust gate. If a test passes against it, they ship, on the belief that it will pass in production.

That makes faithfulness the product, and faithfulness is hard for one reason above all others. The surface is enormous: a large, externally defined set of behaviors, each with its own validation rules, state transitions, error codes, and edge cases, all of which keep changing as the systems being reproduced evolve. A team can write tests for the behaviors it thought of. The defects live in the behaviors it did not.

When one of those gaps slips through, it surfaces in the worst possible place. A customer's continuous integration breaks. A public issue gets filed that reads "this does not behave like the real thing." An enterprise evaluation stalls on an edge case. And like most teams in 2026, the company ships with AI coding agents, which means the surface grows faster than any human can review it.

What LogicStar surfaced

LogicStar analyzed the product's behavior and concentrated its findings in six of its highest-complexity, highest-traffic areas, the exact places where matching the real system is hardest to hold. The findings were not style opinions. They were behavioral divergences from the system the product promises to reproduce.

  • Request validation stricter than the real system. Valid requests rejected because of an incorrect validation rule.
  • State that did not survive across calls. Create and describe operations reading from different internal state, so a follow-up call did not see what had just been written.
  • Fields written under one name and read under another. Stale values returned in responses because of a read-and-write mismatch.
  • Integration metadata dropped on refresh. Drift between what a customer configured and what the product reported back.
  • Edge inputs that produced errors instead of the expected behavior. Specific inputs triggering internal failures rather than the correct response.

Representative issue: a valid, standard configuration rejected as malformed

One finding sat in a configuration-validation path. The validator required an element that the real system, and the product's own published interface, both treat as optional. So a standard, valid configuration was rejected with an error before it was ever stored.

For a customer, the failure chain is the kind that does real damage:

  • the configuration was never persisted, so the behavior it defined was silently never applied,
  • the error appeared only when the developer ran the same workflow against production,
  • and nothing told the developer the difference came from the product rather than their own configuration.

A single incorrect condition in a validator, a few lines of code, was enough to break a workflow that would have worked perfectly in production. This is the long tail. No one writes a test for the rule shape they did not know was wrong.

Why this matters for products built on correctness

When other developers build on your product, a correctness gap is not a cosmetic bug. It is a broken promise, and the most dangerous version is the false negative. A test that passes locally and fails in production does more damage than a test that simply fails, because it ships broken code with your customer's confidence attached to it.

Hand-written test suites and code review only cover the cases a team already thought of. They cannot cover the long tail of edge cases that a large, fast-moving, externally defined surface generates, and that is precisely where the next customer-filed issue is hiding. LogicStar works that long tail continuously, and turns it into merged fixes instead of public bug reports.

What changed

  • Findings were triaged quickly, at an average of 4.2 hours each, far less than the manual reproduction a parity divergence usually demands, and accepted as real divergences rather than noise.
  • The team routed its own test suite and telemetry into the pipeline, so the system generated fixes from the signals the company already trusted.
  • Fixes were merged into the codebase as genuine commits by the company's own maintainers.
  • The team built an internal workflow around the engagement and tracked weekly merge results, and the evaluation converted into a paid plan.

The goal was never to maximize the number of findings. It was to surface the divergences that were real, specific, and worth a maintainer's time, and to deliver them as fixes that could be reviewed and merged.

"We are fixing these issues almost automatically, with minimal input from our engineers, and we are merging the fixes. We are literally moving forward."

The customer's engineering lead

What could have happened without this step

  • A public issue titled "this does not behave like the real thing," visible to every prospect evaluating the product.
  • A failed proof of concept when an enterprise evaluation hit an edge case the team had never tested.
  • Quiet churn from teams who concluded they kept hitting differences they could not explain.

The lesson

AI-assisted development grows your product surface faster than any team can review it. For a product whose contract with its customers is correctness, the gap between what you ship and what you can verify by hand is exactly where your next customer-filed bug is waiting.

A larger surface covered. The long tail caught early. Fixes merged, not issues filed.

Is correctness across a large surface your product's promise?

LogicStar finds, reproduces, and fixes the behavioral and correctness gaps in your codebase, including from your own test suites, and delivers them as reviewed pull requests. See what it surfaces in your code: support@logicstar.ai.

min read
Behavioral Parity Across a Surface Too Large to Test by Hand

When your product's promise is that local behaves like production, a parity gap is a broken promise. How LogicStar finds the correctness gaps hand-written tests miss and ships them as merged fixes.

Read more
March 9, 2026
-
time

Beyond SWE-bench: The Hardest Problem in AI Software Engineering Isn’t Writing Code

Over the past two years, coding agents have made astonishing progress. Modern models can write entire functions, generate patches, and even implement large features. Benchmarks like SWE-bench have become a standard way to evaluate these capabilities. But something important is changing. Recently, OpenAI explained why they are moving away from evaluating models using SWE-bench Verified as the primary benchmark for AI software engineering systems. Their reasoning reflects a deeper shift in how the industry is thinking about AI-driven development. The problem is no longer just writing code. The real problem is deciding what code should change in the first place.

What SWE-bench measures well

SWE-bench has played an important role in advancing AI coding systems. The benchmark asks models to resolve real GitHub issues by producing patches that pass the project's test suite. In simplified form the evaluation looks like this: the agent receives a repository, reads an issue description, generates a patch, and the patch must pass the tests. This measures an important capability: can an AI system implement a fix once the problem is clearly defined? But this assumption hides an important simplification. In real software engineering the hardest part is rarely writing the patch. It is figuring out what the correct change should be.

Real engineering rarely starts with a clean issue

Benchmarks assume a well-formed problem statement. Real software development rarely looks like that. Instead engineers see signals coming from many different systems such as logs and observability platforms, incident alerts, bug trackers, security scanners, static analysis findings, failing CI tests, and customer reports. Each signal may represent only a symptom of a deeper issue. Before any fix can happen engineers must answer a much harder question: which issue actually matters right now? Answering this requires architectural understanding, system knowledge, and engineering judgment.

Coding agents are excellent executors

Recent research from ETH Zürich and LogicStar explored this challenge with a benchmark called CodeTaste. CodeTaste evaluates whether coding agents perform large-scale refactorings in ways that align with human engineers. Unlike many benchmarks CodeTaste focuses on architectural changes across large codebases. The benchmark contains one hundred real refactoring tasks extracted from open-source repositories across six programming languages and each task touches roughly ninety-one files on average. Instead of measuring only correctness CodeTaste measures alignment, which rewards changes that match the structure chosen by the original human refactoring while preserving functional correctness. In other words it evaluates whether the automated change preserves the architectural intent behind the human change. The results are revealing. When given a detailed refactoring blueprint, frontier models achieve alignment scores of up to 70%. When given only a high-level goal, alignment collapses to below 8%. Even when agents first propose a plan and then implement it alignment improves only to around 14%.

Instruction followers, not architects

These results highlight an important limitation. Coding agents today are extremely capable instruction followers. When the plan exists they can execute it, but they still struggle with engineering judgment. Experienced engineers do not just write code. They decide what problem actually needs to be solved, how large the change should be, which architectural trade-offs are acceptable, and how to maintain long-term system integrity. In other words the real challenge in software engineering is not writing the patch. It is identifying the right problem to solve and making the architectural trade-offs required to address it sustainably.

The missing layer in AI software systems

The industry has invested enormous effort in improving code generation. But the future of AI in the SDLC likely depends on solving a different problem. Between detection and fixing lies a critical layer: decision making. This layer determines which signals represent real problems, which issues have the highest impact, and what changes should actually be made. Without this layer AI systems remain tools that help engineers write code. With it they begin to approach autonomous software engineering systems. A realistic AI engineering system therefore needs three capabilities working together. Detection gathers signals from across the development lifecycle including static analysis, observability systems, CI pipelines, security scanners, and bug trackers. Decision determines what should be fixed and why through signal correlation, root cause discovery, impact estimation, architectural reasoning, and prioritization. Execution generates and validates the actual code changes through patch generation, refactoring, automated pull requests, and testing. Most current AI tools focus primarily on the execution layer, but without the decision layer automation risks optimizing the wrong problems.

Most AI coding tools optimize execution. Real software engineering requires a decision layer that determines what should actually be fixed.

Toward truly autonomous software systems

The vision of autonomous software development is becoming increasingly realistic. Coding agents will continue to improve rapidly, but the next breakthrough may not come from models that write code faster. It will come from systems that understand what changes should happen and why. Future systems must preserve architectural intent when engineers guide them and will need to develop architectural foresight as automation increases. Benchmarks like SWE-bench helped the industry measure the first generation of AI coding capabilities, while research like CodeTaste begins to measure the next generation: the ability to align automated changes with human engineering judgment.

Research credits

CodeTaste was developed as part of an ETH Zürich MSc thesis by Alex Thillen, supervised by Niels Mündler, Martin Vechev, and Veselin Raychev.

min read
Beyond SWE-bench: The Hardest Problem in AI Software Engineering Isn’t Writing Code

Coding agents can write code. But can they decide what actually matters in large software systems? We explore why the next generation of AI software tools must move beyond patch generation toward architectural judgment.

Read more
February 2, 2026
-
time

We release the SWE-Star model family: a 7B, 14B, and 32B model based on Qwen2.5-Coder variants and trained on a dataset of 250,000 agent trajectories. Our largest model, SWE-Star-32B, reaches 57.1% on SWE-Bench Verified, setting a new state-of-the-art among open-data models in this size class. The 14B variant reaches 52.8%, significantly outperforming other models of its size. Finally, the 7B variant achieves 36.4% without any signs of saturation, showing the promise of even small models. Using only a single attempt instead of the standard OpenHands iterative protocol of at most 3 attempts until a solution is submitted, we achieve a Pass@1 of 52.4%, 49.8%, and 32.8%, respectively

Figure 1. SWE-Bench Verified Performance; the top numbers use OpenHands’ iterative protocol of at most 3 attempts until a solution is submitted; the bottom numbers use only a single attempt. Evaluated with internet access.

We generate our dataset using a custom lightweight agent, Devstral-2-Small, and SWE-Smith environments. We used MareNostrum 5 (MN5), a European public supercomputer with 4,480 H100 GPUs, for all data generation, training, and evaluation. In this post, we describe how we scaled agentic data generation, training, and evaluation on its highly restricted HPC environment — no Docker, no outbound internet, and massive parallelism — and how these constraints shaped the system design. We also open-source our full agent scaffold, data generation pipeline, and training infrastructure so other researchers can build on this work on similar clusters.

Scaling Distillation for Agentic Coding

Ever since the original scaling laws paper, scaling has been the dominant recipe for improving models — more parameters, more data, more compute, at first focused on pretraining, and more recently mid- and post-training. 

Distilling from strong teacher models is an attractive alternative to scaling post-training because it promises to be a sample-efficient way to let smaller models learn long-horizon reasoning and tool-use behaviors without the overhead of full RL. Over the past year, several works have built SWE-style environments to enable this. Most notably, SWE-Smith introduced a scalable pipeline for injecting bugs into real codebases and back-translating them into realistic but synthetic issues. Using this pipeline, they created 5k agent trajectories using Claude Sonnet 3.7 and observed almost perfect log-linear scaling, pushing Qwen2.5-Coder-32B from 10% to 40%.

Figure 2. SWE-Bench Verified Performance, reported by SWE-Smith (blue) and hypothetical further scaling (gray).

With the best open-weight models approaching 70% on SWE-Bench Verified, we asked: How much of their agentic capability can we distill into smaller, cheaper, and easier-to-deploy models using SFT alone?

Infrastructure for experiments at scale

With a single training run on 100k trajectories consuming roughly 4,500 H100-hours at ~4$ each and our intent to run large-scale ablations, we did not want to simply rent a cluster. So we applied for an EuroHPC grant, an EU initiative that provides access to Europe’s largest supercomputers for researchers and startups, including MareNostrum 5, and were awarded 50k hours after just one week.

Constraints of an HPC environment

MN5 offers lots of compute, but it comes with some unique constraints. For historical reasons common in HPC environments, the cluster has no outbound internet access, and the only interface is SSH access to two login nodes. The system is managed by SLURM, and compute jobs run in a highly restricted user mode. This is very different from typical cloud VMs, where you have full system control. In addition, MN5 uses nodes of 4 H100s with 64GB VRAM each instead of the more common nodes of 8 H100s with 80GB each.

This implies:

  • Setting up dependencies, models, and datasets is non-trivial (no outbound internet).
  • Both the agent and the inference engine must run entirely on the cluster.
  • Standard Docker setups are unavailable; only restricted user-mode Podman is allowed.
  • Most existing agent scaffolds assume internet access, use Docker, and do not scale to hundreds of parallel environments.
  • Most existing configurations for hosting and training models are optimized for a larger per-node memory footprint

To overcome these constraints, we built a custom agent scaffold, forked from mini-swe-agent, that supports OpenHands tooling and scales efficiently under MN5’s constraints. Expert models are hosted via SGLang, data generation is orchestrated through SLURM submissions, and post-training is done with torchtune. The pipeline supports massive parallel data generation and hundreds of concurrent training runs for systematic scaling studies.

Our Agent Scaffold

OpenHands is currently the most popular open-source ReAct-style agent scaffold, providing basic tools for editing and browsing codebases as well as context condensation. While large proprietary models perform reasonably well with minimal tooling, smaller models with limited context windows (e.g., 32k tokens) struggle without structured editing and condensation.

Our design mirrors OpenHands in both tooling and condensation. The agent has access to four tools: think, execute_bash, str_replace_editor, and submit. When the context limit is reached, older observations are masked until the condensed context fits back into the model’s window while preserving space for reasoning and tool calls. We use XML-style tool calls for simplicity, since Qwen2.5-Coder does not support native tool-calling tokens.

Due to MN5’s restricted user mode, each agent runs inside a single-UID Podman container, communicating through two interactive Bash sessions. This differs from common execution-server designs, which require privileged container builds. We translate all str_replace_editor calls into equivalent Bash operations (e.g., first reading a file, editing the file on the host side, and writing it back via cat). A separate dedicated long-running Bash session handles all execute_bash commands.

Generating a Large-Scale Dataset

SWE-Smith created 10k problem statements from which they obtained 5k trajectories after filtering. As we wanted to scale to at least 100k trajectories, we first created problem statements for the remaining 40k instances in the SWE-Smith dataset. Then we had to unroll 250k agent trajectories to be left with 100k after filtering.

Because everything had to run on MN5, we self-hosted our teacher models. Shortly before our project began, Mistral released Devstral-2-Small, a 24B model achieving up to 68% on SWE-Bench Verified with their own agent scaffold. In our offline OpenHands setup, we achieved around 60%, which still provides a strong margin over the ~40% baseline we aimed to surpass. Our ablations also suggested that teacher strength is secondary during early scaling.

Devstral-2-Small fits efficiently on a single 4×H100 node (256 GB VRAM) using SGLang. In agentic workloads, the main bottleneck is the KV cache memory. With up to ~100 turns per trajectory, re-prefilling the same prefix repeatedly severely degrades throughput. A full 32k context occupies ~5.4 GB, and we found ~40 parallel agents per node to be a good trade-off between cache reuse and decode batch size. We further used N-gram speculative decoding, which proved highly effective due to repetitive code patterns.

Each node can unroll roughly 200–300 trajectories per hour. Sequentially generating 250k trajectories would thus take over a month — so we parallelized aggressively. With ~200 nodes, the entire dataset can be generated in under five hours. Each node operates independently, making job scheduling and dataset partitioning straightforward:

Figure 3. We unroll 250k trajectories in parallel across 250 nodes, each node unrolling multiple trajectories in parallel.

Training with Torchtune

We filtered the 250k trajectories to retain only those that passed the final SWE-Smith tests. Because Devstral-2-Small supports contexts up to 256k tokens while Qwen2.5-Coder is trained on 32k, we segmented long traces into approximations of what the agent would observe under context condensation:

Figure 4. To achieve a better match between training and inference distribution despite a mismatch in context length, we split the full trajectory such that every chunk fits into a 32k token context, with all observations visible in a previous chunk being explicitly masked in consecutive chunks.

We chose torchtune for supervised fine-tuning due to its simplicity, memory efficiency, and FSDP2 support. Each of the H100, installed in MN5, provides only 64 GB of VRAM, so we trained across four nodes (16 GPUs total) with full sharding of weights, gradients, and optimizer state in bf16. All models used a learning rate of 5e-5 with a cosine schedule. Activation checkpointing and offloading were necessary to support full 32k context training under these memory constraints.

Results

Figure 5. SWE-Bench Verified Pass@1 over the number of training trajectories across distillation methods. Our distillation approach (blue) achieves much faster scaling than prior work. In combination with using a self-hosted Devstral 2.0 as a teacher, this allows us to achieve the same performance at a fraction of the cost. Ours was evaluated on MN5 without internet access.

Interestingly, we observed much more efficient initial scaling compared to SWE-Smith, despite similar teacher performance. However, this quickly saturated, reaching about 40% SWE-Bench Verified resolution rate with only 800 trajectories for the 32B model. From there on, scaling continues after a short plateau at a significantly slower rate.

Figure 6. SWE-Bench Verified Pass@1 over the number of training trajectories across model sizes. All model sizes show similar training dynamics with quick growth preceding a plateau at around 800 trajectories before a second, slower growth regime that does not saturate at the maximum 100k trajectories we consider.  Evaluated on MN5 without internet.

Interestingly, these dynamics are relatively consistent across model sizes, all realizing quick improvements before plateauing at around 800 trajectories and growing more slowly from ~1600 trajectories onward. The slope of this second stage varies, though, with the 14B model coming surprisingly close to the 32B model, given sufficient training data, and even the 7B model showing no clear signs of saturation, even at 100k trajectories.

We hypothesize that these training dynamics are caused by two different training regimes. In the first regime, the model mostly learns how to use the available tools and agent framework effectively. In the second regime, the model then actually learns how to resolve issues more effectively.

Figure 7. SWE-Bench Verified Pass@k over # of attempts across model sizes. Pass@k scales log-linearly with the number of attempts k for all models, with our 32B model reaching 75.5% Pass@16.  Evaluated on MN5 without internet access.

Analysing how resolution rates change with more attempts, we see a ~15% point improvement with just 3 attempts, and our 32B model reaching 75.5% Pass@16. This indicates that even these small models can solve most tasks with relatively few attempts but lack the high-level guidance to choose the right approach every time. This is a promising sign for a potential RL post-training stage, as it shows that the models did not suffer a mode collapse

Comparison to Concurrent Work

Concurrently with this work, multiple other groups also scaled SFT for agentic coding, achieving slightly worse result with the same context window and comparable results with larger context windows and better base models: Wang et al. create more issues by translating them across repositories achieving 52.2% and 22.8% (compared to our 57.1% and 36.4%) on SWE-Bench Verified, with their 32B and 7B models, respectively. Tao et al. use a more involved SFT approach, masking incorrect steps and the stronger Qwen3 family as base model with a 4x larger 128k context to achieve 52.6% and 42.2%, with their 32B and 8B variants, respectively.  Shen et al. introduce soft verification and build on Qwen3 to achieve 49.5% and 31.7% at a 32k context with their 32B and 8B variants, respectively.

Final Thoughts

As we scaled training data 20x compared to SWE-Smith and improved performance by over 15% points on SWE Bench Verified, we quickly observed the near log-linear scaling, described in earlier work, to saturate with improvements beyond ~40% becoming super-exponentially harder.

We hope our work helps demystify large-scale agentic coding distillation and encourages more open experimentation in this space. To this end, we release our training and data generation pipeline on GitHub and our models and dataset on Huggingface.

What Comes Next?

If you find yourself wondering: Is masking observations really necessary? Is rejection sampling actually helpful? Are we bottlenecked by environment diversity or trajectory quality? Does unrolling each task multiple times help or hurt? — These are exactly the questions we explore in part 2 of this blog post.

Authors: Christian Mürtz & Mark Niklas Müller

A big thank you to Christian Mürtz, who explored this topic during his Master's Thesis at LogicStar, together with our CTO Mark Müller.

This project was built using MareNostrum 5 ACC, one of Europe’s largest operational GPU clusters with 4,480 H100s. All European researchers and startups can apply for 5,000–50,000 H100-hours via EuroHPC AI Factory calls to reproduce, extend, and improve this work. The grant process is fast and straightforward!

min read
SWE-Star: Best-in-Class Agentic Coding Models

We scale distillation of Agentic Coding Capabilities efficiently, to train a family of best-in-class coding models.

Read more
November 21, 2025
-
time

Modern engineering teams face a sharp rise in code written by AI tools, yet the rate of software failures continues to grow. Backlogs expand, incidents slip through, and valuable engineering time is burned on maintenance instead of product development.

LogicStar changes this dynamic with fully validated, autonomous bug fixing.

In this short walk-through, our co-founder Mark Müller shows a real example from our own services. The LogicStar Agent detects a tricky concurrency bug, reproduces it in our sandbox, evaluates candidate fixes, validates the correct one with targeted tests, and ships a merge-ready pull request.

Watch the demo video:
“Interested in Self-Healing Software? Check out this walk-through where I demonstrate how the LogicStar Agent finds, reproduces, and fixes a tricky concurrency issue in our codebase, providing me with a merge-ready and well-tested PR.”

Our team now sees several such fixes every day, all generated before we even open our laptops. As one comment noted:

“It is amazing that we are getting a couple of such bug fixes every day and 95 percent of the work is done before we even log in in the morning.”

The result is simple. Faster recovery, fewer regressions, and engineering teams that can finally focus on shipping meaningful product improvements instead of combing through incidents.

If you want to explore what self-healing applications can unlock for your team, get in touch for a trial.

min read
How LogicStar Autonomously Finds and Fixes A Real Bug in Our Production Code

LogicStar Autonomously Finds and Fixes A Real Bug in Our Production Code

Read more
November 17, 2025
-
time

Over the past year, agentic coding tools like Cursor, Claude Code, and Codex have been adopted at remarkable speed. They already account for roughly 20% of public GitHub PRs [1] and teams using them report up to 50% productivity gains [2] in the early phases of adoption. But as review workloads spike and larger, more complex changes land faster than teams can absorb them, code quality begins to slip. The long-term benefits are far less clear.

In this post, we examine why today’s AI-assisted development workflows hit a wall and how Self-Healing Software can break through it.

[1] insights.logicstar.ai

[2] The AI Productivity Paradox Report

Speed at the Cost of Quality

Analysis of the effects of adopting cursor over time. The number of commits and added lines is significantly increased in the first two months after adoption but falls back to baseline levels afterward. Signs of technical debt (static analysis warning and code complexity), however, remain high. Reproduced from He, Hao, et al. "Speed at the Cost of Quality? The Impact of LLM Agent Assistance on Software Development."

A recent CMU study [3] analyzing over 800 GitHub repositories that adopted Cursor identified a consistent pattern:

  • 3–5× more code added in the first month
  • ~30% increase in static analysis warnings
  • ~40% increase in code complexity
  • After two months, velocity returned to baseline, while technical debt indicators stayed high

The takeaway is clear: when software can be produced faster than it can be reviewed, tested, and consolidated, quality becomes the limiting factor.

[3] He, Hao, et al. "Speed at the Cost of Quality? The Impact of LLM Agent Assistance on Software Development." arXiv 2025

Why Speed Alone Isn’t Enough

Distribution of the ratio of added and removed lines across GitHub pull requests depending on whether the PR was written by a human or code agent. Humans generally remove and modify more lines compared to agents, which tend to add more new lines. Modified from insights.logicstar.ai.

Agentic coding tools don’t just help developers write code faster; they encourage writing more new code.

Analysing all public PRs on GitHub over the last 6 months, we find that AI-generated PRs tend to add significantly more lines than human-authored ones [1]. This is not just because LLMs generate verbose solutions. It reflects a deeper architectural problem:

  • Understanding and reusing existing code requires a lot of codebase context
  • Code agents can’t persist this context across problems, but have to gather it from scratch every time
  • Generating new code is often easier for the agent than building this context
Effect of AI adoption on developer productivity metrics. While task throughput and PR merge rate increase, the median review time also almost doubled. Reproduced from The AI Productivity Paradox Report.

In parallel, human reviewers now face larger, more complex PRs. Review quality drops, subtle bugs slip through, and duplicated patterns proliferate. The result is predictable: a burst of short-term acceleration followed by a plateau, or even slowdown, as technical debt accumulates and the codebase becomes harder to navigate and context more difficult to gather [3].

How Self-Healing Applications Close the Loop

To achieve sustained acceleration, it isn’t enough for AI to write new features faster. We need AI that also maintains the ever-growing, ever-more-complex codebase.

This means building systems that can automatically:

  • Detect functional, security, and code-quality issues
  • Generate high-quality fixes
  • Validate these fixes for correctness and side effects

In other words, software must be able to self-heal. As a result, development velocity will not just spike briefly before grinding to a halt, but grow sustainably as features get added while issues get automatically resolved.

How LogicStar AI Fits In

At LogicStar, we build exactly this missing piece, a platform for self-healing applications.

Our platform continuously analyzes applications, identifies real issues, generates candidate fixes, and verifies them using rigorous programmatic reasoning. This enables applications to become increasingly resilient, even as AI agents generate more of the underlying code.

A key advantage of LogicStar’s approach is how we understand the codebase. While most code agents use simple search tools like grep to explore a codebase, LogicStar builds a static-analysis–driven knowledge graph of the entire codebase. This persistent representation captures data flows, control flows, invariants, and component relationships that traditional agents must rediscover from scratch on every run. As a result, LogicStar can reason about bugs and validate fixes with far greater efficiency, depth, and consistency.

By leveraging this understanding to give software the ability to repair itself, we turn AI-driven feature development from a short-lived boost into long-term, compounding productivity.


Author: Mark Niklas Müller

min read
Closing the Agentic Coding Loop with Self-Healing Software

AI coding agents accelerate development but also drive up complexity and technical debt, causing early productivity gains to fade. Self-Healing Software closes this gap by automatically detecting and fixing issues as fast as new code is generated. LogicStar provides this capability, keeping codebases healthy and velocity sustainable.

Read more
September 26, 2025
-
time

Evaluating coding agents shouldn’t feel like watching paint dry. Yet with SWE-Bench Verified, it often does—hundreds of Docker images totaling 240 GiB, throttled by rate limits*, turn the first setup on a new machine into a 30-hour ordeal. Want to test across a broader, less overfitted, and more representative set of repositories, by also using our SWA-Bench and SWEE-Bench or your own environments? Good luck; things only get slower.

So we decided to fix that. By restructuring layers, trimming unnecessary files, and compressing the results, we shrank SWE-Bench Verified from 240 GiB to just 5 GiB. Now it downloads in under a minute, making large-scale evaluation and trace generation on cloud machines fast and painless.

*100 images per 6h as an unauthenticated user, 200 as an authenticated user without Docker Hub Pro

Background

Evaluating SWE-Bench Verified requires 500 containerized environments, one for each issue across twelve repositories. Your options are either to build all of them from scratch (and pray all dependencies were pinned) or to pull the prebuilt images from Docker Hub. Neither choice is great. Building takes hours and can introduce inconsistencies. Pulling requires downloading more than 100 GiB of compressed layers and expanding them into 240 GiB of local storage. Even with a Docker Hub Pro subscription and a fast connection, this process takes anywhere from half an hour to several hours. Without a Pro account, rate limits make it even worse—you can spend 30 hours just waiting for pulls to finish.

The situation becomes truly painful if you want to evaluate more instances at scale on ephemeral cloud machines. Copying 100s of GiB around the world hundreds of times adds up quickly. So we set out to make the environment images light enough to be dropped onto a fresh machine in minutes.

The Layering Problem

At the core of every Docker image lies a stack layers representing filesystem changes. When a container runs, Docker (via OverlayFS) looks for the topmost layer containing a requested file and reads it from there. The container itself adds a thin writable layer on top: when you modify a file, Docker copies it into this writable layer so changes never affect the underlying image layers.

This design is clever because it makes image storage and distribution efficient. If two images share a base like ubuntu:latest, they can both use the same base layer and only add their own differences on top. However, every file that is modified is fully duplicated.

For SWE-Bench, every image starts with ubuntu:22.04. Then comes one of 63 distinct “environment” layers that set up dependencies, and finally one of 500 "instance" layers, including the repository checkout at the right commit.

The problem is that while the environment layers share many dependencies and repositories change very little between commits, the resulting layers are still different. As a result, full copies are created every time. While every checkout is only a few hundred megabytes, that quickly adds up when multiplied by 500 instances.

In short, the way SWE-Bench (Verified) images are constructed leads to hundreds of near-duplicate layers adding up to 240 GiB.

Fixing the Layering Problem

To resolve this, we introduce a technique we call delta layering. Instead of creating a single layer for every checkout containing a full copy of the repository, we post-process the images so that each instance layer only adds the difference - the delta - to the commit before.

The intuition is simple: two snapshots of the same repository taken only a few weeks apart are nearly identical. Yet in the default layering scheme, both snapshots get packaged as full copies; delta layering removes that duplication.

We build chronological chains—one per repository—where each instance builds directly on top of the previous one. The resulting layers become small changes between commits (including potential dependency changes), instead of big, redundant snapshots. Only Django had so many instances that we had to split it into two chains due to Docker’s hard limit of 125 layers per image.

All of these chains share a common base layer that holds the truly universal pieces - Ubuntu 22.04, Conda, and other system-level dependencies. 

Could we get the same result by just cloning the chronologically last state of the repo and then checking out the right commit? Unfortunately, no. This would leave future commits in the git history, which can and did get exploited by agents to cheat.

Git History and Packfiles

Delta layering solves much of the duplication problem, but there’s a hidden complication: git history. Each SWE-Bench image includes the full git history of the repository up to the point when the issue was created. In principle, this shouldn’t be a huge deal. Git stores its data as a key–value database of objects: commits, trees, and blobs. Adding a new commit just creates a few new objects-the changed files, changed directories, and the commit object itself. If everything were stored as loose zlib-compressed files in .git/objects, delta layering could simply capture the handful of new objects.

But in practice, git uses packfiles. A packfile bundles thousands of objects into a single large file and applies compression across them. This is great for efficiency, but the problem is that every time a new packfile is generated, that’s an entirely new multi-hundred-megabyte file from Docker's perspective. As a result, all the benefits of delta layering vanish.

To resolve this problem, we restructured the packfiles, creating one per instance, containing all additional git objects. We do lose some of git’s internal compression, but the trade-off is worth it: small, incremental layers instead of massive redundant packfiles.

Removing Build Artifacts

Many of the images contained leftovers from the build process that were never needed at runtime—installers, caches, etc. For example, the Miniconda installer alone added 136 MB to every image. Pip and Conda caches consumed even more. Removing these shaves off gigabytes at essentially no cost.

Final Compression

In addition to making each layer as small as possible, we also apply cross-layer compression. While Docker’s layer model copies the entire file when a single line changes, compression algorithms are very good at spotting such repeated data.

We choose zstd because it’s fast, highly parallel, and supports very large compression windows. To give the compressor the best shot, we sorted the layers by their chronological chain order. That way, nearly identical layers sit next to each other in the input stream. As a result the entire benchmark, 240GiB of raw images, now fits into a single 5 GiB archive.

Using 100 cores, the compression process below takes around ten minutes. Decompression, however, is extremely fast—about forty seconds on a single core.

zstd --T100 -19 --long=31 layers.tar

Original Size Our Size
Uncompressed (Podman) 240 GiB 31 GiB
Compressed (Registry) 106 GiB 12.4 GiB
Compressed II (Zstd) n/a 5.0 GiB

Summary

All told, our optimizations bring SWE-Bench Verified down from 240 GiB of raw layers to just 31 GiB uncompressed—and with the right compression, a single archive of only 5 GiB. That archive is small enough to download and unpack in about five minutes on any modern machine. And the best thing, the core of our optimization – delta layer – is not SWE-Bench specific and can be easily applied to any other series of execution environments. Because Docker and Podman can’t natively load compressed bundles, we’ve provided helper scripts on GitHub. The final archive itself is hosted on Hugging Face, supporting fast downloads.

If all you care about is the quickest way to set up SWE-Bench Verified, here it is:

curl -L -#  https://huggingface.co/LogicStar/SWE-Bench-Verified-Compressed/resolve/main/saved.tar.zst?download=true | zstd -d --long=31 --stdout | docker load

What's Next?

Execution environments are not only essential for evaluating code agents but also for training code models. Regardless of whether you do RL or SFT, generating high-quality training data requires diverse agent traces, which in turn require a large number of execution environments. Execution environments which we can now efficiently store and distribute to a large number of ephemeral machines to generate a large number of traces…

Stay tuned to learn more about what comes next.

Authors: Christian Mürtz & Mark Niklas Müller

min read
How We Made SWE-Bench 50x Smaller

We optimized the OCI layer structure of code execution environments to improve storage and distribution at scale

Read more
September 10, 2025
-
time

In a series of posts, we will outline some the core technologies behind LogicStar.

At LogicStar AI, we are building the platform for self-healing software applications, leveraging agentic systems to autonomously identify, reproduce, and fix bugs. This requires rigorous testing and thorough validation of every application behavior to avoid introducing new issues or wasting reviewer time. Therefore, test generation is an area of key importance at LogicStar.

Our vision is to deliver substantial value for commercial applications; rather than flashy AI demos, we design LogicStar to avoid wasting developer time in reviewing partial or almost correct pull requests.

To drive innovation in test generation, we have developed and open-sourced SWT-Bench, also published at NeurIPS 2024. The popular SWE-Bench requires code agents to fix given issues, SWT-bench tests their ability to generate effective tests. This allows us to develop agents that excel at test generation. Within LogicStar, we orchestrate these test and code generation agents that collaboratively produce well-tested patches for every bug we address.

This system allows our agents to score 84% on the SWT-Bench, beating the previous state-of-the-art of 75.8%, held by the OpenHands team. We achieve this performance by combining multiple agents and models, iteratively refining both code and tests. The seamless orchestration of these agents heavily relies on our proprietary technology, including advanced static analysis tools used directly by our agents. As our agents do not rely on Internet access, there is no risk of leaking your source code, secrets, or your customers' data. Instead, our agents leverage advanced code search capabilities, iterative feedback driven by code execution with coverage metrics, and static analysis tools developed by LogicStar for building codebase understanding.

We are rolling out our latest agent advancements with selected design partners who share our vision for self-healing applications and are helping us shape the future of this technology. Their collaboration ensures that our research delivers immediate value for commercial software. If you also believe in this direction and work with Python, JavaScript or TypeScript repositories, we invite you to join sign up here. We will support you through onboarding and ensure full SOC2 compliance.

min read
SWT-Bench Verified – Best Test Generation at 84%

The L* Agent achieves a new state-of-the-art of 84% on SWT-Bench Verified

Read more
May 22, 2025
-
time

At LogicStar AI, trust, security, and operational excellence are foundational to how we build and deliver our autonomous software maintenance platform.

We’re proud to share that LogicStar has successfully completed a SOC 2 audit conducted by an independent third-party firm, validating the design and implementation of our security controls in alignment with the AICPA Trust Services Criteria.

This achievement reflects our commitment to safeguarding customer data and building secure systems from day one.

Importantly, we’ve also implemented continuous monitoring processes that ensure our controls remain active and effective — not just at a point in time, but throughout our operations.

SOC 2 compliance is one step in our broader mission to build infrastructure our customers and partners can rely on with confidence.

If you’re a customer or vendor and would like to receive a copy of our SOC 2 audit report, please reach out to: info@logicstar.ai

min read
Read more
March 3, 2025
-
time

We’re excited to announce that LogicStar AI has officially joined the ETH Zurich AI Center as an affiliate member🎉! This partnership is a significant milestone for us, especially since many of our team members have deep roots in AI research at ETH.

As a pioneering AI company focused on autonomous software maintenance, this collaboration strengthens our commitment to advancing AI research and innovation. Partnering with one of the world’s leading AI research hubs at ETH Zurich will accelerate our efforts in building cutting-edge AI agents that autonomously detect, reproduce, and fix software bugs-transforming the way engineers maintain commercial applications. At LogicStar we harness AI along classical computer science to empower engineering teams and AI agents to autonomously maintain commercial applications, enabling faster resolution of issues and empowering engineering teams to focus on innovation by automating tedious application maintenance tasks.

What This Means for LogicStar AI & Our Community:
✅ Access to World-Class Research & Talent - Collaborating with ETH Zurich’s AI experts, faculty, and students to push the boundaries of AI-powered software development.
✅ Advancing AI Reliability & Explainability - Working alongside top researchers to refine AI verification and validation techniques, ensuring robust and trustworthy autonomous coding agents.
✅ Stronger AI Ecosystem - Engaging with startups, industry leaders, and academia to shape the future of self-healing software and AI-driven code maintenance.

This partnership marks a major milestone in our mission to revolutionize software reliability. We’re excited about the journey ahead and look forward to working with ETH Zurich’s brilliant minds to make AI a seamless, dependable partner for engineering teams.

As AI is central to our mission, we are at the forefront of AI research and innovation. Being affiliated with the ETH Zurich AI Center allows us to do just that. We’re excited to collaborate as we advance the field of agentic AI for application maintenance together!

min read
ETH AI Center Affiliation

LogicStar AI Joins the ETH AI Center as an Affiliate! 🚀

Read more
February 24, 2025
-
time

LLM-generated applications are here. Some well-known tools now offer to turn anyone into an app developer, while others aim to make current developers more productive. Given the concerns about the security, support, and future development of these quickly made apps, we wanted to measure this with a benchmark. Our focus wasn’t on any specific tools but on the LLM models that power them. We present BaxBench - a benchmark of 392 instances – 28 scenarios for LLMs to implement using 14 different frameworks in 6 languages such as Django, ExpressJS, Flask, Ruby on Rails and others. For leaderboard and more benchmark details, together with the team at ETH Zurich we have created baxbench.com.

We conducted an analysis of backend applications generated by LLMs with a specific focus on assessing their vulnerability exposure to real security exploits. Backends are integral to many systems and vary in complexity; some are designed to manage the entire state of an application, while others are constructed by integrating multiple specialized services, known as microservices. Applications rely on one or more security critical backends to perform tasks such as handling logins, managing application states, and storing user data. To evaluate these systems, we developed BaxBench, which consists of small and frequently seen tasks for application backends. These backends are granted access to a database for storage, and large language models (LLMs) are tasked with generating their logic based on predefined OpenAPI specifications. Our findings revealed that many of these backends were not secure, as we were to execute actual attacks against them. This goes beyond mere analysis or tool-generated warnings about hypothetical security issues - we successfully executed real exploits, including SQL injection, path traversal, and user impersonation. It is crucial to emphasize that our specifications did not suggest any vulnerabilities; the vulnerabilities we exploited arose from the outputs generated by the LLMs.

One interesting point is that we can ask LLMs to fix these vulnerabilities, and they manage to solve many of them. They do best when we tell them exactly what we will be trying to exploit. However, even then, not every vulnerability goes away and there is a trade-off - when security issues are fixed, we measure that some apps stop working properly. This creates a big opportunity for tools like the ones we are building at LogicStar, which can both identify and fix these security issues. And of course, the benchmark is open source, so security and application development experts can help us add more scenarios or new ways to exploit vulnerabilities. In addition to this, we expect that LLMs will also get better thanks to benchmarks like BaxBench.

Looking deeper, it’s clear that correctness and security aren’t the only challenges. LLMs also struggle to create reliable code in different backend frameworks, especially those that aren’t the most popular one. Engineers see firsthand that LLMs can have trouble with complex and varied tasks, which means the results aren’t always perfect. Sometimes, even the best tools get stuck and can’t improve an app by just using more LLM prompts. However, to make progress, you have to start by measuring the problem. With BaxBench, we looked at security, and going forward-at LogicStar, we are focusing on checking and improving how well the models can understand existing apps, fix real problems in supporting them, and ultimately make their and your end users happier.

For more work on maintaining software, you can follow our research, blogs and social media. If you’re running an app with Python backends, we’d love to talk about how our early access product can help maintain that app by fixing bugs.

Need more information? Have a look at the paper and the baxbench.com website.

Please file issues or contribute to the benchmark code.

min read
Introducing BaxBench

BaxBench: Can LLMs Generate Secure and Correct Backends?

Read more
February 4, 2025
-
time

On - [4th February, 2025] - TechCrunch’s senior reporter Natasha Lomas wrote this article about LogicStar.

The text of the article is quoted below: “ Swiss startup LogicStar is bent on joining the AI agent game. The summer 2024-founded startup has bagged $3 million in pre-seed funding to bring tools to the developer market that can do autonomous maintenance of software applications, rather than the more typical AI agent use-case of code co-development.

LogicStar CEO and co-founder Boris Paskalev (pictured top right, in the feature image, with his fellow co-founders) suggests the startup’s AI agents could end up partnering with code development agents - such as, say, the likes of Cognition Labs’ Devin - in a business win-win.

Code fidelity is an issue for AI agents building and deploying software, just as it is for human developers, and LogicStar wants to do its bit to grease the development wheel by automatically picking up and fixing bugs wherever they may crop up in deployed code.

As it stands, Paskalev suggests that “even the best models and agents” out there are unable to resolve the majority of bugs they’re presented with - hence the team spying an opportunity for an AI startup that’s dedicated to improving these odds and delivering on the dream of less tedious app maintenance.

To this end, they are building atop large language models (LLMs) - such as OpenAI’s GPT or even China’s DeepSeek - taking a model-agnostic approach for their platform. This allows LogicStar to dip into different LLMs and maximize its AI agents’ utility, based on which foundational model works best for resolving a particular code issue.

Paskalev contends that the founding team has the technical and domain-specific knowledge to build a platform that can resolve programming problems which can challenge or outfox LLMs working alone. They also have past entrepreneurial success to point to: he sold his prior code review startup, DeepCode, to cybersecurity giant Snyk back in September 2020.

“In the beginning we were thinking about actually building a large language model for code,” he told TechCrunch. “Then we realized that that will quickly become a commodity… Now we’re building assuming all those large language models are there. Assuming there’s some actually decent [AI] agents for code, how do we extract the maximum business value from them?”

He said that the idea built on the team’s understanding of how to analyze software applications. “Combine that with large language models - then focus into grounding and verifying what those large language models and the AI agent actually suggest.”

Test-driven development What does that mean in practice? Paskalev says LogicStar performs an analysis of each application that its tech is deployed on - using “classical computer science methods” - in order to build a “knowledge base”. This gives its AI agent a comprehensive map of the software’s inputs and outputs; how variables link to functions; and any other linkages and dependencies etc.

Then, for every bug it’s presented with, the AI agent is able to determine which parts of the application are impacted - allowing LogicStar to narrow down the functions needing to be simulated in order to test scores of potential fixes.

Per Paskalev, this “minimized execution environment” allows the AI agent to run “thousands” of tests aimed at reproducing bugs to identify a “failing test”, and - through this “test-driven development” approach - ultimately land on a fix that sticks.

He confirms that the actual bug fixes are sourced from the LLMs. But because LogicStar’s platform enables this “very fast executive environment” its AI agents can work at scale to separate the wheat from the chaff, as it were, and serve its users with a shortcut to the best that LLMs can offer.

“What we see is [LLMs are] great for prototyping, testing things, etc, but it’s absolutely not great for [code] production, commercial applications. I think we’re far from there, and this is what our platform delivers,” he argued. “To be able to extract those capabilities of the models today, we can actually safely extract commercial value and actually save time for developers to really focus on the important stuff.”

Enterprises are set to be LogicStar’s initial target. Its “silicon agents” are intended to be put to work alongside corporate dev teams, albeit at a fraction of the salary required to hire a human developer, handling a range of app upkeep tasks and freeing up engineering talent for more creative and/or challenging work. (Or, well, at least until LLMs and AI agents get a lot more capable.)

While the startup’s pitch touts a “fully autonomous” app maintenance capability, Paskalev confirms that the platform will allow human developers to review (and otherwise oversee) the fixes its AI agents call up. So trust can be - and must be - earned first.

“The accuracy that a human developer delivers ranges between 80 to 90%. Our goal [for our AI agents] is to be exactly there,” he adds.

It’s still early days for LogicStar: an alpha version of its technology is in testing with a number of undisclosed companies which Paskalev refers to as “design partners”. Currently the tech only supports Python - but expansions to Typescript, Javascript and Java are billed as “coming soon”.

“The main goal [with the pre-seed funding] is to actually show the technology works with our design partners - focusing on Python,” adds Paskalev. “We already spent a year on it, and we have lots of opportunity to actually expand. And that’s why we’re trying to focus it first, to show the value in one case.”

The startup’s pre-seed raise was led by European VC firm Northzone, with angel investors from DeepMind, Fleet, Sequoia scouts, Snyk and Spotify also joining the round.

In a statement, Michiel Kotting, partner at Northzone, said: “AI-driven code generation is still in its early stages, but the productivity gains we’re already seeing are revolutionary. The potential for this technology to streamline development processes, reduce costs, and accelerate innovation is immense. and the team’s vast technical expertise and proven track record position them to deliver real, impactful results. The future of software development is being reshaped, and LogicStar will play a crucial role in software maintenance.”

LogicStar is operating a waiting list for potential customers wanting to express interest in getting early access. It told us a beta release is planned for later this year. “

min read
TechCrunch Article About LogicStar

A TechCrunch article about us titled LogicStar is building AI agents for app maintenance

Read more
December 18, 2024
-
time

SWT-Bench: Benchmarking CodeAgents’ Test Generation Capabilities

As the complexity of modern software systems grows, so does the challenge of ensuring their reliability. To this end, rigorous testing plays a critical role in maintaining high software quality. However, while the rise of Large Language Models (LLMs) has catalyzed advancements in code generation, their potential in test automation remains underexplored. Enter SWT-Bench, a novel benchmark for test generation based on real-world GitHub issues, developed in collaboration with ETH Zurich. With the release of a public leaderboard at swtbench.com, we aim to spark a similar push from the research community on test generation as SWE-Bench caused for code generation.

What is SWT-Bench?

SWT-Bench is a test generation benchmark based on real-world GitHub issues. The objective is to generate a test reproducing the described issue given the full codebase. We determine whether a test reproduces an issue by checking whether it fails on the original codebase but passes after a human-written ground truth fix, taken from a corresponding pull request (PR), has been applied. We call this the success rate \mathcal{S} Additionally, we measure the coverage \Delta \mathcal{C} of the lines modified in this ground truth bug fix to further assess the test quality.

How did we create SWT-Bench?

Starting with over 90,000 PRs from 12 popular GitHub repositories, we applied rigorous filtering to obtain 1,900 diverse and high-quality instances. SWT-Bench thus reflects the complexity of modern software ecosystems, challenging AI systems to navigate large codebases (up to 700k lines), interpret nuanced issue descriptions (320 words average), and integrate tests into diverse existing test suites and frameworks (from pytest to tox to custom frameworks).

First Results

Performance of Code Agents

We found that Code Agents, originally designed for program repair (e.g. SWE-Agent), perform well on test-generation tasks, even outperforming dedicated test-generation methods (LIBRO). However, even minimal modifications like explicitly instructing the agent to execute the generated tests (SWE-Agent+) significantly improve performance further. This highlights the potential of dedicated Code Agents for test generation.

A new Patch Format for Test Generation

Based on the insight that test generation is typically solved by adding a new (test) function or class, we propose a novel patch format tailored for fault tolerance and simplicity. This format alone, allows vanilla LLMs to generate executable tests in twice as many cases (ZeroShot vs ZeroShotPlus) leading to almost 3 times as many solved instances.

Utility of generated tests:
Automatically generating high-quality tests not only allows developers to focus on (test-driven) development generating real business value but can also boost the performance of code generation agents. In particular, the generated tests can guide them along the whole generation process from informing context localization to bug fix validation. Early results show that simply using generated tests to filter proposed bug-fixes can more than double the achieved precision.

Correlation of Test and Fix Generation
While we observe that Code Agents who perform well on code generation also perform well on test generation, we interestingly doe not see such a correlation for individual issues. That is an issue that is easy to fix is not necessarily easy to test and vice versa. Indeed, we see no statistically significant correlation between the hardness/resolution rate of these tasks, highlighting the unique challenges of test generation.

Implications for the Future of Software Maintenance

SWT-Bench demonstrates the capability of LLMs to interpret and formalize the intent of natural language issue descriptions into tests. This has the potential to in the long run significantly improve software quality by making thorough testing attainable without significant manual efforts. In a next step, this can even enable self-healing systems by automatically detecting, reproducing, and resolving issues in real-time, as they appear, minimizing downtime and increasing reliability.

We at LogicStar AI believe that reliable automated testing is the key to unlocking the real potential of Code Agents and will be essential to push the frontier in automated application maintenance. Therefore, we are extra excited to see the great interest of the community in SWT-Bench and hope that our public leaderboard can make it even more accessible.

For more details, check out our NeurIPS paper (https://arxiv.org/pdf/2406.12952) or our open-source code (https://github.com/logic-star-ai/swt-bench.

min read
Introducing the SWT-Bench Leaderboard!

SWT-Bench Benchmarking CodeAgents' Test Generation Capabilities

Read more
December 5, 2024
-
time

Researchers and entrepreneurs from INSAIT and ETH Zurich have launched LogicStar AI, a new deep-tech startup, which is building fully autonomous agentic AI that helps teams maintain their software.

Founding Team
The founding team behind LogicStar AI is star-studded and includes the founders of DeepCode.ai (now Snyk Code) which is currently delivering more than $100M ARR for Snyk. The LogicStar AI founders are:

🌟 Boris Paskalev (CEO), formerly CEO of DeepCode, then Director of Product AI at Snyk, and currently a Strategic Entrepreneurship Advisor at INSAIT.
🌟 Dr. Mark Niklas Müller (CTO), AI PhD from ETH Zurich, formerly an engineer at Porsche and Mercedes AMG Petronas F1 Team.
🌟 Dr. Veselin Raychev (Chief Architect), formerly CTO of DeepCode, then Head of AI at Snyk and researcher at INSAIT.
🌟 Prof. Martin Vechev (Adviser), full professor at ETH Zurich, founder and scientific director of INSAIT.

LogicStar AI Mission
LogicStar AI is building an agentic AI platform for automatically validating, reproducing, and fixing bugs with high precision. Their technology empowers engineering teams to focus on creating new features, driving growth and innovation, by reducing the burden of maintenance and debugging issues. With LogicStar AI, developers can thus spend more of their time delivering real business value, while LogicStar AI reliably addresses software maintenance problems without manual intervention.

INSAIT’s mission
INSAIT (https://insait.ai) is a world-class computer science and AI research institution, founded in 2022 in partnership with Switzerland’s ETH Zurich and EPFL. The focus of INSAIT is on conducting world-class research and attracting outstanding faculty, research scientists, postdocs, and PhD students. In the short time since its inception, INSAIT has published over 50 papers in all major AI venues, as well as in premier theory conferences.

Join the Journey
As LogicStar AI embarks on this exciting new chapter, we invite talented individuals to join our team and shape the future of reliable AI for code and software applications. For more information on career opportunities, partnerships, or to learn about our innovative solutions, please visit our website and follow LogicStar on LinkedIn. LogicStar AI has offices in both Sofia, Bulgaria and Zurich, Switzerland.

min read
Agentic AI from INSAIT and ETH Zurich

INSAIT and ETH Zurich Entrepreneurs launch LogicStar AI, a new Agentic AI startup

Read more
October 17, 2024
-
time

We are excited to share SWT-Bench, the first benchmark for reproducing bugs and validating their fixes based on GitHub issue descriptions. We presented SWT-Bench at two ICML workshops and want to thank everyone who stopped by for their interest, enthusiasm, and the great discussions we had. We now see a community trend to not only focus on fixing bugs but also generating tests that can effectively reproduce them and validate that proposed fixes truly resolve the issues. We believe this is essential for achieving truly autonomous bug fixing, which is what LogicStar delivers.

In our paper, we demonstrate how any code repair benchmark with a known ground truth solution can be transformed into a test generation and issue reproduction benchmark. There, the goal is to create a “reproducing test” that fails on the original codebase and passes after the ground truth fix has been applied. Our analysis shows that Code Agents excel in this task and outperform dedicated LLM-based test generation methods. Leveraging these tests for code repair further allows us to significantly enhance precision. To learn more, please check out our preprint paper.

LogicStar AI builds on top of this research to achieve a truly autonomous bug fixing that you can trust as you trust your top engineers.

min read
SWT-Bench

A Benchmark for Testing and Validating Bugfixes

Read more
July 1, 2024
-
time

Zurich, Switzerland - [4th February, 2025] - LogicStar, the AI agent for fully autonomous software maintenance, has raised $3M (CHF 2.6M) in a pre-seed funding round led by Northzone, with angel investors from DeepMind, Snyk, Spotify, Fleet and Sequoia scouts. LogicStar empowers engineering teams to focus on innovation by automating tedious application maintenance tasks.

LogicStar is revolutionising software maintenance with its autonomous code agent, designed to deliver self-healing software applications that empower engineers to focus on innovation and growth. LogicStar works seamlessly alongside human developers, autonomously reproducing application issues, testing solutions and proposing precise fixes without the need for human oversight. The world’s rapid adoption of AI has sparked a wave of pivotal trends that are reshaping industries and workflows. Organisations relying on custom software spend considerable resources and time on maintenance and bug fixes, which divert developers from innovation. AI coding agents, despite performing well on benchmarks and simple tasks, tend to introduce errors in complex settings, leaving teams stuck with tedious maintenance tasks.

The founding team consists of Boris Paskalev, Veselin Raychev, Mark Müller, and Prof. Dr. Martin Vechev. Boris, Veselin, and Martin previously built DeepCode.ai (acquired by Snyk and now called Snyk Code) and scaled it to over $100M ARR: a technology trusted by millions of developers. Martin also leads the Secure, Reliable, and Intelligent Systems (SRI) lab at ETH Zurich and is the Founder and Scientific Director of INSAIT. The unique technology behind LogicStar draws from the team’s deep research background and expertise from ETH Zurich, MIT, TRIUM, and INSAIT, resulting in over 20,000 citations and 350 top publications in AI and program analysis, particularly in large codebases and software development.

LogicStar has already released SWT-Bench to support the development of code agents and demonstrated that existing code agents are not up to the challenge of enterprise code bases, failing on >95% of issues. Using an advanced mock execution environment, LogicStar swiftly runs generated tests to reproduce issues and confirm solutions - spotting errors before you’re aware of them. At the core of LogicStar’s technology lies a blend of the latest advancements in LLMs for code and classical computer science techniques. The platform is rapidly evolving, with Python support already available and expansions to Typescript, Javascript, and Java coming soon. Technology leaders managing commercial software systems are invited to join the waiting list to experience the benefits of LogicStar firsthand.

Boris Paskalev comments, “I am excited that developers can focus on innovation and creativity while automation handles the burden of application maintenance. Our platform eliminates the need to oversee the current generation of agents and LLMs in maintaining commercial software. Providing an evolving solution that seamlessly grows along LLM advancements and maximises successful task completion.”

Michiel Kotting, partner at Northzone adds, “AI-driven code generation is still in its early stages, but the productivity gains we’re already seeing are revolutionary. The potential for this technology to streamline development processes, reduce costs, and accelerate innovation is immense. and the team’s vast technical expertise and proven track record position them to deliver real, impactful results. The future of software development is being reshaped, and LogicStar will play a crucial role in software maintenance.”

About Logic Star
LogicStar is the AI agent for fully autonomous application maintenance. Founded in 2024, the company is headquartered in Zurich, Switzerland and is backed by global venture firm Northzone, as well as angels from DeepMind, Snyk, Spotify, Fleet, and Sequoia Scouts.

About Northzone
Northzone (northzone.com) is a global venture capital fund built on experience spanning multiple economic and disruptive technology cycles. Founded in 1996, Northzone has raised more than ten funds to date, with its most recent fundraise in excess of $1.2 billion and has invested in more than 175 companies, including category-defining businesses such as Trustpilot, Spotify, Klarna, iZettle, Kahoot!, Personio, TrueLayer, Spring Health, amongst others. Northzone is a full-stack investor from Seed to Growth stage, with transatlantic hubs out of London, New York, Amsterdam, Berlin, Stockholm and Oslo.

min read
LogicStar AI raised a $3m round led by Northzone

LogicStar, building the AI agent for fully autonomous application maintenance, raised a $3m round led by Northzone.

Read more
July 1, 2024
-
time

LogicStar AI is looking for passionate software engineers to join our team. We are a team of researchers, engineers, and product people that focus on cutting edge research and quickly bringing it to a product. If you are interested in working with us, please review our jobs on our careers page and send your resume to jobs@logicstar.ai.

min read
Jobs

We are looking for passionate software engineers to join our team

Read more
April 11, 2024
-
time

🚀 We are thrilled to introduce LogicStar, a pioneering deep-tech startup based in Switzerland, revolutionizing application monitoring and maintenance.

Our cutting-edge platform blends AI with proven computer science methodologies to create agentic AI-tailored application mocks. These mocks reproduce software bugs in an AI-powered mock execution environment, enabling scalable evaluation and verification of AI-driven fix suggestions.

Our exceptional team comprises experts and top researchers from ETH Zurich, INSAIT, and MIT, alongside seasoned serial entrepreneurs, united by a shared mission to redefine the future of software reliability.

✨ Join us on this transformative journey as we push the boundaries of network monitoring and maintenance with groundbreaking innovation.

min read
Introducing LogicStar

We are excited to announce the launch of LogicStar AI, our startup to revolutionize application monitoring.

Read more
LogicStar AI logo – autonomous software maintenance and self-healing applications

Stop guessing what to fix

Start fixing what matters

LogicStar shows the bugs impacting customers and revenue, ranked and ready to act on.

No workflow changes. Results in ~1 hour.

Screenshot of LogicStar generating production-ready pull requests with 100 percent test coverage, static analysis, and regression validationScreenshot of LogicStar generating production-ready pull requests with 100 percent test coverage, static analysis, and regression validation