September 27, 2025
-
time
min read

At LogicStar, our mission is to build a platform for self-healing applications. This relies on a strong bug-fixing backbone and review system working hand in hand to produce high-quality code fixes where possible, while abstaining rather than proposing incorrect fixes. We are therefore excited to announce that we not only have the best test generation system (announced last week) but also reached the state-of-the-art in fix generation with  76.8% accuracy on SWE-Bench Verified, the most competitive benchmark for automated bug fixing. Combining these systems, we achieve 80% precision, i.e., if our agent proposes a code fix, it is ready to merge 8 out of 10 times.

We are particularly proud that we achieved these results with our cost-effective production system rather than an agent carefully tuned for SWE-Bench and too expensive to ever run on customer problems. To achieve this, our L* Agent v1 leverages only the cost-effective OpenAI GPT-5 and GPT-5-mini, breaks down the bug fixing problem into clear sub-problems, and then orchestrates multiple sub-agents to investigate, reproduce, and fix the issue, before carefully reviewing and testing the generated code fix. All of this is enabled by our agent’s unique codebase understanding, powered by proprietary static analysis. 

So, how does our L* Agent work and why is it so cost-effective? The main insight is to combine a strong model (GPT-5), generating baseline patches and tests, with diverse cheaper agents based on GPT-5-mini, to increase diversity before picking the best patch using our state-of-the-art tests. All of this is enabled by our static-analysis-powered codebase understanding, which boosts the performance of both the weak and strong models.

We prioritize correctness and validation over speed, processing all issues asynchronously, as soon as they appear in your bug backlog or observability. This approach ensures you don’t have to waste time manually triaging and reviewing issues but simply receive high-quality patches from LogicStar for the issues we can solve confidently. We are now turning this technology into a loveable product, and invite you to sign up as a design partner if you’d like to help us build a system that will reliably maintain your code. While SWE-Bench is an important benchmark, it’s only part of the story — we are developing our agents for real-world use and not only benchmarks, so be sure to follow us for more updates.

SWE-Bench Verified – Best Fix Generation at 76.8%

The L* agent achieves state-of-the-art results on SWE-Bench Verified using an ensemble of cheap agents and strong validation

Read more

All news

March 9, 2026
-
time

Beyond SWE-bench: The Hardest Problem in AI Software Engineering Isn’t Writing Code

Over the past two years, coding agents have made astonishing progress. Modern models can write entire functions, generate patches, and even implement large features. Benchmarks like SWE-bench have become a standard way to evaluate these capabilities. But something important is changing. Recently, OpenAI explained why they are moving away from evaluating models using SWE-bench Verified as the primary benchmark for AI software engineering systems. Their reasoning reflects a deeper shift in how the industry is thinking about AI-driven development. The problem is no longer just writing code. The real problem is deciding what code should change in the first place.

What SWE-bench measures well

SWE-bench has played an important role in advancing AI coding systems. The benchmark asks models to resolve real GitHub issues by producing patches that pass the project's test suite. In simplified form the evaluation looks like this: the agent receives a repository, reads an issue description, generates a patch, and the patch must pass the tests. This measures an important capability: can an AI system implement a fix once the problem is clearly defined? But this assumption hides an important simplification. In real software engineering the hardest part is rarely writing the patch. It is figuring out what the correct change should be.

Real engineering rarely starts with a clean issue

Benchmarks assume a well-formed problem statement. Real software development rarely looks like that. Instead engineers see signals coming from many different systems such as logs and observability platforms, incident alerts, bug trackers, security scanners, static analysis findings, failing CI tests, and customer reports. Each signal may represent only a symptom of a deeper issue. Before any fix can happen engineers must answer a much harder question: which issue actually matters right now? Answering this requires architectural understanding, system knowledge, and engineering judgment.

Coding agents are excellent executors

Recent research from ETH Zürich and LogicStar explored this challenge with a benchmark called CodeTaste. CodeTaste evaluates whether coding agents perform large-scale refactorings in ways that align with human engineers. Unlike many benchmarks CodeTaste focuses on architectural changes across large codebases. The benchmark contains one hundred real refactoring tasks extracted from open-source repositories across six programming languages and each task touches roughly ninety-one files on average. Instead of measuring only correctness CodeTaste measures alignment, which rewards changes that match the structure chosen by the original human refactoring while preserving functional correctness. In other words it evaluates whether the automated change preserves the architectural intent behind the human change. The results are revealing. When given a detailed refactoring blueprint, frontier models achieve alignment scores of up to 70%. When given only a high-level goal, alignment collapses to below 8%. Even when agents first propose a plan and then implement it alignment improves only to around 14%.

Instruction followers, not architects

These results highlight an important limitation. Coding agents today are extremely capable instruction followers. When the plan exists they can execute it, but they still struggle with engineering judgment. Experienced engineers do not just write code. They decide what problem actually needs to be solved, how large the change should be, which architectural trade-offs are acceptable, and how to maintain long-term system integrity. In other words the real challenge in software engineering is not writing the patch. It is identifying the right problem to solve and making the architectural trade-offs required to address it sustainably.

The missing layer in AI software systems

The industry has invested enormous effort in improving code generation. But the future of AI in the SDLC likely depends on solving a different problem. Between detection and fixing lies a critical layer: decision making. This layer determines which signals represent real problems, which issues have the highest impact, and what changes should actually be made. Without this layer AI systems remain tools that help engineers write code. With it they begin to approach autonomous software engineering systems. A realistic AI engineering system therefore needs three capabilities working together. Detection gathers signals from across the development lifecycle including static analysis, observability systems, CI pipelines, security scanners, and bug trackers. Decision determines what should be fixed and why through signal correlation, root cause discovery, impact estimation, architectural reasoning, and prioritization. Execution generates and validates the actual code changes through patch generation, refactoring, automated pull requests, and testing. Most current AI tools focus primarily on the execution layer, but without the decision layer automation risks optimizing the wrong problems.

Most AI coding tools optimize execution. Real software engineering requires a decision layer that determines what should actually be fixed.

Toward truly autonomous software systems

The vision of autonomous software development is becoming increasingly realistic. Coding agents will continue to improve rapidly, but the next breakthrough may not come from models that write code faster. It will come from systems that understand what changes should happen and why. Future systems must preserve architectural intent when engineers guide them and will need to develop architectural foresight as automation increases. Benchmarks like SWE-bench helped the industry measure the first generation of AI coding capabilities, while research like CodeTaste begins to measure the next generation: the ability to align automated changes with human engineering judgment.

Research credits

CodeTaste was developed as part of an ETH Zürich MSc thesis by Alex Thillen, supervised by Niels Mündler, Martin Vechev, and Veselin Raychev.

min read
Beyond SWE-bench: The Hardest Problem in AI Software Engineering Isn’t Writing Code

Coding agents can write code. But can they decide what actually matters in large software systems? We explore why the next generation of AI software tools must move beyond patch generation toward architectural judgment.

Read more
February 2, 2026
-
time

We release the SWE-Star model family: a 7B, 14B, and 32B model based on Qwen2.5-Coder variants and trained on a dataset of 250,000 agent trajectories. Our largest model, SWE-Star-32B, reaches 57.1% on SWE-Bench Verified, setting a new state-of-the-art among open-data models in this size class. The 14B variant reaches 52.8%, significantly outperforming other models of its size. Finally, the 7B variant achieves 36.4% without any signs of saturation, showing the promise of even small models. Using only a single attempt instead of the standard OpenHands iterative protocol of at most 3 attempts until a solution is submitted, we achieve a Pass@1 of 52.4%, 49.8%, and 32.8%, respectively

Figure 1. SWE-Bench Verified Performance; the top numbers use OpenHands’ iterative protocol of at most 3 attempts until a solution is submitted; the bottom numbers use only a single attempt. Evaluated with internet access.

We generate our dataset using a custom lightweight agent, Devstral-2-Small, and SWE-Smith environments. We used MareNostrum 5 (MN5), a European public supercomputer with 4,480 H100 GPUs, for all data generation, training, and evaluation. In this post, we describe how we scaled agentic data generation, training, and evaluation on its highly restricted HPC environment — no Docker, no outbound internet, and massive parallelism — and how these constraints shaped the system design. We also open-source our full agent scaffold, data generation pipeline, and training infrastructure so other researchers can build on this work on similar clusters.

Scaling Distillation for Agentic Coding

Ever since the original scaling laws paper, scaling has been the dominant recipe for improving models — more parameters, more data, more compute, at first focused on pretraining, and more recently mid- and post-training. 

Distilling from strong teacher models is an attractive alternative to scaling post-training because it promises to be a sample-efficient way to let smaller models learn long-horizon reasoning and tool-use behaviors without the overhead of full RL. Over the past year, several works have built SWE-style environments to enable this. Most notably, SWE-Smith introduced a scalable pipeline for injecting bugs into real codebases and back-translating them into realistic but synthetic issues. Using this pipeline, they created 5k agent trajectories using Claude Sonnet 3.7 and observed almost perfect log-linear scaling, pushing Qwen2.5-Coder-32B from 10% to 40%.

Figure 2. SWE-Bench Verified Performance, reported by SWE-Smith (blue) and hypothetical further scaling (gray).

With the best open-weight models approaching 70% on SWE-Bench Verified, we asked: How much of their agentic capability can we distill into smaller, cheaper, and easier-to-deploy models using SFT alone?

Infrastructure for experiments at scale

With a single training run on 100k trajectories consuming roughly 4,500 H100-hours at ~4$ each and our intent to run large-scale ablations, we did not want to simply rent a cluster. So we applied for an EuroHPC grant, an EU initiative that provides access to Europe’s largest supercomputers for researchers and startups, including MareNostrum 5, and were awarded 50k hours after just one week.

Constraints of an HPC environment

MN5 offers lots of compute, but it comes with some unique constraints. For historical reasons common in HPC environments, the cluster has no outbound internet access, and the only interface is SSH access to two login nodes. The system is managed by SLURM, and compute jobs run in a highly restricted user mode. This is very different from typical cloud VMs, where you have full system control. In addition, MN5 uses nodes of 4 H100s with 64GB VRAM each instead of the more common nodes of 8 H100s with 80GB each.

This implies:

  • Setting up dependencies, models, and datasets is non-trivial (no outbound internet).
  • Both the agent and the inference engine must run entirely on the cluster.
  • Standard Docker setups are unavailable; only restricted user-mode Podman is allowed.
  • Most existing agent scaffolds assume internet access, use Docker, and do not scale to hundreds of parallel environments.
  • Most existing configurations for hosting and training models are optimized for a larger per-node memory footprint

To overcome these constraints, we built a custom agent scaffold, forked from mini-swe-agent, that supports OpenHands tooling and scales efficiently under MN5’s constraints. Expert models are hosted via SGLang, data generation is orchestrated through SLURM submissions, and post-training is done with torchtune. The pipeline supports massive parallel data generation and hundreds of concurrent training runs for systematic scaling studies.

Our Agent Scaffold

OpenHands is currently the most popular open-source ReAct-style agent scaffold, providing basic tools for editing and browsing codebases as well as context condensation. While large proprietary models perform reasonably well with minimal tooling, smaller models with limited context windows (e.g., 32k tokens) struggle without structured editing and condensation.

Our design mirrors OpenHands in both tooling and condensation. The agent has access to four tools: think, execute_bash, str_replace_editor, and submit. When the context limit is reached, older observations are masked until the condensed context fits back into the model’s window while preserving space for reasoning and tool calls. We use XML-style tool calls for simplicity, since Qwen2.5-Coder does not support native tool-calling tokens.

Due to MN5’s restricted user mode, each agent runs inside a single-UID Podman container, communicating through two interactive Bash sessions. This differs from common execution-server designs, which require privileged container builds. We translate all str_replace_editor calls into equivalent Bash operations (e.g., first reading a file, editing the file on the host side, and writing it back via cat). A separate dedicated long-running Bash session handles all execute_bash commands.

Generating a Large-Scale Dataset

SWE-Smith created 10k problem statements from which they obtained 5k trajectories after filtering. As we wanted to scale to at least 100k trajectories, we first created problem statements for the remaining 40k instances in the SWE-Smith dataset. Then we had to unroll 250k agent trajectories to be left with 100k after filtering.

Because everything had to run on MN5, we self-hosted our teacher models. Shortly before our project began, Mistral released Devstral-2-Small, a 24B model achieving up to 68% on SWE-Bench Verified with their own agent scaffold. In our offline OpenHands setup, we achieved around 60%, which still provides a strong margin over the ~40% baseline we aimed to surpass. Our ablations also suggested that teacher strength is secondary during early scaling.

Devstral-2-Small fits efficiently on a single 4×H100 node (256 GB VRAM) using SGLang. In agentic workloads, the main bottleneck is the KV cache memory. With up to ~100 turns per trajectory, re-prefilling the same prefix repeatedly severely degrades throughput. A full 32k context occupies ~5.4 GB, and we found ~40 parallel agents per node to be a good trade-off between cache reuse and decode batch size. We further used N-gram speculative decoding, which proved highly effective due to repetitive code patterns.

Each node can unroll roughly 200–300 trajectories per hour. Sequentially generating 250k trajectories would thus take over a month — so we parallelized aggressively. With ~200 nodes, the entire dataset can be generated in under five hours. Each node operates independently, making job scheduling and dataset partitioning straightforward:

Figure 3. We unroll 250k trajectories in parallel across 250 nodes, each node unrolling multiple trajectories in parallel.

Training with Torchtune

We filtered the 250k trajectories to retain only those that passed the final SWE-Smith tests. Because Devstral-2-Small supports contexts up to 256k tokens while Qwen2.5-Coder is trained on 32k, we segmented long traces into approximations of what the agent would observe under context condensation:

Figure 4. To achieve a better match between training and inference distribution despite a mismatch in context length, we split the full trajectory such that every chunk fits into a 32k token context, with all observations visible in a previous chunk being explicitly masked in consecutive chunks.

We chose torchtune for supervised fine-tuning due to its simplicity, memory efficiency, and FSDP2 support. Each of the H100, installed in MN5, provides only 64 GB of VRAM, so we trained across four nodes (16 GPUs total) with full sharding of weights, gradients, and optimizer state in bf16. All models used a learning rate of 5e-5 with a cosine schedule. Activation checkpointing and offloading were necessary to support full 32k context training under these memory constraints.

Results

Figure 5. SWE-Bench Verified Pass@1 over the number of training trajectories across distillation methods. Our distillation approach (blue) achieves much faster scaling than prior work. In combination with using a self-hosted Devstral 2.0 as a teacher, this allows us to achieve the same performance at a fraction of the cost. Ours was evaluated on MN5 without internet access.

Interestingly, we observed much more efficient initial scaling compared to SWE-Smith, despite similar teacher performance. However, this quickly saturated, reaching about 40% SWE-Bench Verified resolution rate with only 800 trajectories for the 32B model. From there on, scaling continues after a short plateau at a significantly slower rate.

Figure 6. SWE-Bench Verified Pass@1 over the number of training trajectories across model sizes. All model sizes show similar training dynamics with quick growth preceding a plateau at around 800 trajectories before a second, slower growth regime that does not saturate at the maximum 100k trajectories we consider.  Evaluated on MN5 without internet.

Interestingly, these dynamics are relatively consistent across model sizes, all realizing quick improvements before plateauing at around 800 trajectories and growing more slowly from ~1600 trajectories onward. The slope of this second stage varies, though, with the 14B model coming surprisingly close to the 32B model, given sufficient training data, and even the 7B model showing no clear signs of saturation, even at 100k trajectories.

We hypothesize that these training dynamics are caused by two different training regimes. In the first regime, the model mostly learns how to use the available tools and agent framework effectively. In the second regime, the model then actually learns how to resolve issues more effectively.

Figure 7. SWE-Bench Verified Pass@k over # of attempts across model sizes. Pass@k scales log-linearly with the number of attempts k for all models, with our 32B model reaching 75.5% Pass@16.  Evaluated on MN5 without internet access.

Analysing how resolution rates change with more attempts, we see a ~15% point improvement with just 3 attempts, and our 32B model reaching 75.5% Pass@16. This indicates that even these small models can solve most tasks with relatively few attempts but lack the high-level guidance to choose the right approach every time. This is a promising sign for a potential RL post-training stage, as it shows that the models did not suffer a mode collapse

Comparison to Concurrent Work

Concurrently with this work, multiple other groups also scaled SFT for agentic coding, achieving slightly worse result with the same context window and comparable results with larger context windows and better base models: Wang et al. create more issues by translating them across repositories achieving 52.2% and 22.8% (compared to our 57.1% and 36.4%) on SWE-Bench Verified, with their 32B and 7B models, respectively. Tao et al. use a more involved SFT approach, masking incorrect steps and the stronger Qwen3 family as base model with a 4x larger 128k context to achieve 52.6% and 42.2%, with their 32B and 8B variants, respectively.  Shen et al. introduce soft verification and build on Qwen3 to achieve 49.5% and 31.7% at a 32k context with their 32B and 8B variants, respectively.

Final Thoughts

As we scaled training data 20x compared to SWE-Smith and improved performance by over 15% points on SWE Bench Verified, we quickly observed the near log-linear scaling, described in earlier work, to saturate with improvements beyond ~40% becoming super-exponentially harder.

We hope our work helps demystify large-scale agentic coding distillation and encourages more open experimentation in this space. To this end, we release our training and data generation pipeline on GitHub and our models and dataset on Huggingface.

What Comes Next?

If you find yourself wondering: Is masking observations really necessary? Is rejection sampling actually helpful? Are we bottlenecked by environment diversity or trajectory quality? Does unrolling each task multiple times help or hurt? — These are exactly the questions we explore in part 2 of this blog post.

Authors: Christian Mürtz & Mark Niklas Müller

A big thank you to Christian Mürtz, who explored this topic during his Master's Thesis at LogicStar, together with our CTO Mark Müller.

This project was built using MareNostrum 5 ACC, one of Europe’s largest operational GPU clusters with 4,480 H100s. All European researchers and startups can apply for 5,000–50,000 H100-hours via EuroHPC AI Factory calls to reproduce, extend, and improve this work. The grant process is fast and straightforward!

min read
SWE-Star: Best-in-Class Agentic Coding Models

We scale distillation of Agentic Coding Capabilities efficiently, to train a family of best-in-class coding models.

Read more
November 24, 2025
-
time

The next decade of software development will be defined by autonomous systems that prevent failures before they reach users. LogicStar is building this future with fully autonomous self healing capabilities for backend applications. We are opening the LogicStar Visionaries Program, a selective group of engineering teams who want to help shape this new category.

Unlike the LocalStack specific offer, the Visionaries Program is open to any company, regardless of tooling or cloud setup.

Program benefits
Members receive:

  • Early access to new LogicStar platform features
  • Priority tuning for their codebase
  • Reserved channels for feedback and collaboration
  • Support from the core LogicStar engineering team
  • Optional public recognition on our site and events
  • Preferential  discounts on pricing

This program is designed for teams that want to be part of the shift from reactive maintenance to autonomous operations for their commercial prodcutcts/software. It is not limited by company size ( an engineering team of at least 8 people). What matters is your willingness to collaborate and shape the future.

Who should apply
Teams that:

  • Build backend services in Python, JavaScript, or TypeScript (more to come ... join our waiting list)
  • Want to reduce time spent on maintenance and support
  • Care about reliability and stability at scale
  • Want early access to new autonomous capabilities
  • Are willing to provide structured feedback to the LogicStar engineering team

How to join
Email visionaries@logicstar.ai with:

  • Company name
  • Backend stack (Python, JS, TS)
  • One to two sentences describing your main reliability or maintenance pain points
  • Optional: a link to a representative repository

A LogicStar team member will follow up with next steps.

Why now
Software is changing. Code volume is rising, AI assistance is accelerating development, and production issues are rising as a result. Most incidents already have a signal before they reach customers. LogicStar is built to detect these signals, enrich them, reproduce the issue, and deliver validated fixes without human involvement.

We want to collaborate with teams who see the same future and want to influence how it is built.

min read
Announcing the LogicStar Visionaries Program

LogicStar's Visionaries Program: A selective group shaping the next generation of autonomous software maintenance.

Read more
November 24, 2025
-
time

The LocalStack community is one of the most active engineering groups pushing the boundaries of cloud development. Many teams rely on LocalStack to move fast, validate changes locally, and keep their services stable. LogicStar now extends an exclusive offer to this community to help teams go even faster while staying reliable.

The Offer
All LocalStack users receive 50% off for twelve months when adopting LogicStar. This applies to any LogicStar plan and covers backend projects built with Python, JavaScript, or TypeScript.

This offer is valid for thirty days from the date of the webinar announcement.

Why this matters
Backend systems are scaling faster than ever. Teams face increasing pressure to deliver new features while keeping production clean, safe, and stable. LogicStar’s autonomous agents run continuously, find problems, reproduce them, validate fixes, and prepare merge ready pull requests. As a result, teams spend less time firefighting and more time building.

How to claim the offer
Email localstack-offer@logicstar.ai with the following:

  • Your company name
  • Your LocalStack usage details
  • Your backend language (Python, JS, TS)

A LogicStar engineer will onboard your project and activate the discount.

Who is eligible

  • Any team that uses LocalStack
  • Backend applications written in Python, JavaScript, or TypeScript
  • New LogicStar users who join within thirty days of the offer (by Dec 24, 2025)

What you get from LogicStar

  • Continuous autonomous detection of serious backend issues
  • Automatic reproduction inside a secure dedicated sandboxed environment
  • Validated fixes delivered as pull requests
  • Reduced noise and faster recovery
  • A platform that operates before your team even logs in

If your team is growing and you want to keep quality under control while shipping faster, this offer is designed for you.

If you want to be even more involved and define and use the latest advances 1st, check out our Visionary Program.

min read
Fifty Percent Off LogicStar for LocalStack Users

A limited offer for LocalStack users building their backends with Python, JavaScript, or TypeScript.

Read more
November 21, 2025
-
time

Modern engineering teams face a sharp rise in code written by AI tools, yet the rate of software failures continues to grow. Backlogs expand, incidents slip through, and valuable engineering time is burned on maintenance instead of product development.

LogicStar changes this dynamic with fully validated, autonomous bug fixing.

In this short walk-through, our co-founder Mark Müller shows a real example from our own services. The LogicStar Agent detects a tricky concurrency bug, reproduces it in our sandbox, evaluates candidate fixes, validates the correct one with targeted tests, and ships a merge-ready pull request.

Watch the demo video:
“Interested in Self-Healing Software? Check out this walk-through where I demonstrate how the LogicStar Agent finds, reproduces, and fixes a tricky concurrency issue in our codebase, providing me with a merge-ready and well-tested PR.”

Our team now sees several such fixes every day, all generated before we even open our laptops. As one comment noted:

“It is amazing that we are getting a couple of such bug fixes every day and 95 percent of the work is done before we even log in in the morning.”

The result is simple. Faster recovery, fewer regressions, and engineering teams that can finally focus on shipping meaningful product improvements instead of combing through incidents.

If you want to explore what self-healing applications can unlock for your team, get in touch for a trial.

min read
How LogicStar Autonomously Finds and Fixes A Real Bug in Our Production Code

LogicStar Autonomously Finds and Fixes A Real Bug in Our Production Code

Read more
November 17, 2025
-
time

Over the past year, agentic coding tools like Cursor, Claude Code, and Codex have been adopted at remarkable speed. They already account for roughly 20% of public GitHub PRs [1] and teams using them report up to 50% productivity gains [2] in the early phases of adoption. But as review workloads spike and larger, more complex changes land faster than teams can absorb them, code quality begins to slip. The long-term benefits are far less clear.

In this post, we examine why today’s AI-assisted development workflows hit a wall and how Self-Healing Software can break through it.

[1] insights.logicstar.ai

[2] The AI Productivity Paradox Report

Speed at the Cost of Quality

Analysis of the effects of adopting cursor over time. The number of commits and added lines is significantly increased in the first two months after adoption but falls back to baseline levels afterward. Signs of technical debt (static analysis warning and code complexity), however, remain high. Reproduced from He, Hao, et al. "Speed at the Cost of Quality? The Impact of LLM Agent Assistance on Software Development."

A recent CMU study [3] analyzing over 800 GitHub repositories that adopted Cursor identified a consistent pattern:

  • 3–5× more code added in the first month
  • ~30% increase in static analysis warnings
  • ~40% increase in code complexity
  • After two months, velocity returned to baseline, while technical debt indicators stayed high

The takeaway is clear: when software can be produced faster than it can be reviewed, tested, and consolidated, quality becomes the limiting factor.

[3] He, Hao, et al. "Speed at the Cost of Quality? The Impact of LLM Agent Assistance on Software Development." arXiv 2025

Why Speed Alone Isn’t Enough

Distribution of the ratio of added and removed lines across GitHub pull requests depending on whether the PR was written by a human or code agent. Humans generally remove and modify more lines compared to agents, which tend to add more new lines. Modified from insights.logicstar.ai.

Agentic coding tools don’t just help developers write code faster; they encourage writing more new code.

Analysing all public PRs on GitHub over the last 6 months, we find that AI-generated PRs tend to add significantly more lines than human-authored ones [1]. This is not just because LLMs generate verbose solutions. It reflects a deeper architectural problem:

  • Understanding and reusing existing code requires a lot of codebase context
  • Code agents can’t persist this context across problems, but have to gather it from scratch every time
  • Generating new code is often easier for the agent than building this context
Effect of AI adoption on developer productivity metrics. While task throughput and PR merge rate increase, the median review time also almost doubled. Reproduced from The AI Productivity Paradox Report.

In parallel, human reviewers now face larger, more complex PRs. Review quality drops, subtle bugs slip through, and duplicated patterns proliferate. The result is predictable: a burst of short-term acceleration followed by a plateau, or even slowdown, as technical debt accumulates and the codebase becomes harder to navigate and context more difficult to gather [3].

How Self-Healing Applications Close the Loop

To achieve sustained acceleration, it isn’t enough for AI to write new features faster. We need AI that also maintains the ever-growing, ever-more-complex codebase.

This means building systems that can automatically:

  • Detect functional, security, and code-quality issues
  • Generate high-quality fixes
  • Validate these fixes for correctness and side effects

In other words, software must be able to self-heal. As a result, development velocity will not just spike briefly before grinding to a halt, but grow sustainably as features get added while issues get automatically resolved.

How LogicStar AI Fits In

At LogicStar, we build exactly this missing piece, a platform for self-healing applications.

Our platform continuously analyzes applications, identifies real issues, generates candidate fixes, and verifies them using rigorous programmatic reasoning. This enables applications to become increasingly resilient, even as AI agents generate more of the underlying code.

A key advantage of LogicStar’s approach is how we understand the codebase. While most code agents use simple search tools like grep to explore a codebase, LogicStar builds a static-analysis–driven knowledge graph of the entire codebase. This persistent representation captures data flows, control flows, invariants, and component relationships that traditional agents must rediscover from scratch on every run. As a result, LogicStar can reason about bugs and validate fixes with far greater efficiency, depth, and consistency.

By leveraging this understanding to give software the ability to repair itself, we turn AI-driven feature development from a short-lived boost into long-term, compounding productivity.


Author: Mark Niklas Müller

min read
Closing the Agentic Coding Loop with Self-Healing Software

AI coding agents accelerate development but also drive up complexity and technical debt, causing early productivity gains to fade. Self-Healing Software closes this gap by automatically detecting and fixing issues as fast as new code is generated. LogicStar provides this capability, keeping codebases healthy and velocity sustainable.

Read more
September 26, 2025
-
time

Evaluating coding agents shouldn’t feel like watching paint dry. Yet with SWE-Bench Verified, it often does—hundreds of Docker images totaling 240 GiB, throttled by rate limits*, turn the first setup on a new machine into a 30-hour ordeal. Want to test across a broader, less overfitted, and more representative set of repositories, by also using our SWA-Bench and SWEE-Bench or your own environments? Good luck; things only get slower.

So we decided to fix that. By restructuring layers, trimming unnecessary files, and compressing the results, we shrank SWE-Bench Verified from 240 GiB to just 5 GiB. Now it downloads in under a minute, making large-scale evaluation and trace generation on cloud machines fast and painless.

*100 images per 6h as an unauthenticated user, 200 as an authenticated user without Docker Hub Pro

Background

Evaluating SWE-Bench Verified requires 500 containerized environments, one for each issue across twelve repositories. Your options are either to build all of them from scratch (and pray all dependencies were pinned) or to pull the prebuilt images from Docker Hub. Neither choice is great. Building takes hours and can introduce inconsistencies. Pulling requires downloading more than 100 GiB of compressed layers and expanding them into 240 GiB of local storage. Even with a Docker Hub Pro subscription and a fast connection, this process takes anywhere from half an hour to several hours. Without a Pro account, rate limits make it even worse—you can spend 30 hours just waiting for pulls to finish.

The situation becomes truly painful if you want to evaluate more instances at scale on ephemeral cloud machines. Copying 100s of GiB around the world hundreds of times adds up quickly. So we set out to make the environment images light enough to be dropped onto a fresh machine in minutes.

The Layering Problem

At the core of every Docker image lies a stack layers representing filesystem changes. When a container runs, Docker (via OverlayFS) looks for the topmost layer containing a requested file and reads it from there. The container itself adds a thin writable layer on top: when you modify a file, Docker copies it into this writable layer so changes never affect the underlying image layers.

This design is clever because it makes image storage and distribution efficient. If two images share a base like ubuntu:latest, they can both use the same base layer and only add their own differences on top. However, every file that is modified is fully duplicated.

For SWE-Bench, every image starts with ubuntu:22.04. Then comes one of 63 distinct “environment” layers that set up dependencies, and finally one of 500 "instance" layers, including the repository checkout at the right commit.

The problem is that while the environment layers share many dependencies and repositories change very little between commits, the resulting layers are still different. As a result, full copies are created every time. While every checkout is only a few hundred megabytes, that quickly adds up when multiplied by 500 instances.

In short, the way SWE-Bench (Verified) images are constructed leads to hundreds of near-duplicate layers adding up to 240 GiB.

Fixing the Layering Problem

To resolve this, we introduce a technique we call delta layering. Instead of creating a single layer for every checkout containing a full copy of the repository, we post-process the images so that each instance layer only adds the difference - the delta - to the commit before.

The intuition is simple: two snapshots of the same repository taken only a few weeks apart are nearly identical. Yet in the default layering scheme, both snapshots get packaged as full copies; delta layering removes that duplication.

We build chronological chains—one per repository—where each instance builds directly on top of the previous one. The resulting layers become small changes between commits (including potential dependency changes), instead of big, redundant snapshots. Only Django had so many instances that we had to split it into two chains due to Docker’s hard limit of 125 layers per image.

All of these chains share a common base layer that holds the truly universal pieces - Ubuntu 22.04, Conda, and other system-level dependencies. 

Could we get the same result by just cloning the chronologically last state of the repo and then checking out the right commit? Unfortunately, no. This would leave future commits in the git history, which can and did get exploited by agents to cheat.

Git History and Packfiles

Delta layering solves much of the duplication problem, but there’s a hidden complication: git history. Each SWE-Bench image includes the full git history of the repository up to the point when the issue was created. In principle, this shouldn’t be a huge deal. Git stores its data as a key–value database of objects: commits, trees, and blobs. Adding a new commit just creates a few new objects-the changed files, changed directories, and the commit object itself. If everything were stored as loose zlib-compressed files in .git/objects, delta layering could simply capture the handful of new objects.

But in practice, git uses packfiles. A packfile bundles thousands of objects into a single large file and applies compression across them. This is great for efficiency, but the problem is that every time a new packfile is generated, that’s an entirely new multi-hundred-megabyte file from Docker's perspective. As a result, all the benefits of delta layering vanish.

To resolve this problem, we restructured the packfiles, creating one per instance, containing all additional git objects. We do lose some of git’s internal compression, but the trade-off is worth it: small, incremental layers instead of massive redundant packfiles.

Removing Build Artifacts

Many of the images contained leftovers from the build process that were never needed at runtime—installers, caches, etc. For example, the Miniconda installer alone added 136 MB to every image. Pip and Conda caches consumed even more. Removing these shaves off gigabytes at essentially no cost.

Final Compression

In addition to making each layer as small as possible, we also apply cross-layer compression. While Docker’s layer model copies the entire file when a single line changes, compression algorithms are very good at spotting such repeated data.

We choose zstd because it’s fast, highly parallel, and supports very large compression windows. To give the compressor the best shot, we sorted the layers by their chronological chain order. That way, nearly identical layers sit next to each other in the input stream. As a result the entire benchmark, 240GiB of raw images, now fits into a single 5 GiB archive.

Using 100 cores, the compression process below takes around ten minutes. Decompression, however, is extremely fast—about forty seconds on a single core.

zstd --T100 -19 --long=31 layers.tar

Original Size Our Size
Uncompressed (Podman) 240 GiB 31 GiB
Compressed (Registry) 106 GiB 12.4 GiB
Compressed II (Zstd) n/a 5.0 GiB

Summary

All told, our optimizations bring SWE-Bench Verified down from 240 GiB of raw layers to just 31 GiB uncompressed—and with the right compression, a single archive of only 5 GiB. That archive is small enough to download and unpack in about five minutes on any modern machine. And the best thing, the core of our optimization – delta layer – is not SWE-Bench specific and can be easily applied to any other series of execution environments. Because Docker and Podman can’t natively load compressed bundles, we’ve provided helper scripts on GitHub. The final archive itself is hosted on Hugging Face, supporting fast downloads.

If all you care about is the quickest way to set up SWE-Bench Verified, here it is:

curl -L -#  https://huggingface.co/LogicStar/SWE-Bench-Verified-Compressed/resolve/main/saved.tar.zst?download=true | zstd -d --long=31 --stdout | docker load

What's Next?

Execution environments are not only essential for evaluating code agents but also for training code models. Regardless of whether you do RL or SFT, generating high-quality training data requires diverse agent traces, which in turn require a large number of execution environments. Execution environments which we can now efficiently store and distribute to a large number of ephemeral machines to generate a large number of traces…

Stay tuned to learn more about what comes next.

Authors: Christian Mürtz & Mark Niklas Müller

min read
How We Made SWE-Bench 50x Smaller

We optimized the OCI layer structure of code execution environments to improve storage and distribution at scale

Read more
September 10, 2025
-
time

In a series of posts, we will outline some the core technologies behind LogicStar.

At LogicStar AI, we are building the platform for self-healing software applications, leveraging agentic systems to autonomously identify, reproduce, and fix bugs. This requires rigorous testing and thorough validation of every application behavior to avoid introducing new issues or wasting reviewer time. Therefore, test generation is an area of key importance at LogicStar.

Our vision is to deliver substantial value for commercial applications; rather than flashy AI demos, we design LogicStar to avoid wasting developer time in reviewing partial or almost correct pull requests.

To drive innovation in test generation, we have developed and open-sourced SWT-Bench, also published at NeurIPS 2024. The popular SWE-Bench requires code agents to fix given issues, SWT-bench tests their ability to generate effective tests. This allows us to develop agents that excel at test generation. Within LogicStar, we orchestrate these test and code generation agents that collaboratively produce well-tested patches for every bug we address.

This system allows our agents to score 84% on the SWT-Bench, beating the previous state-of-the-art of 75.8%, held by the OpenHands team. We achieve this performance by combining multiple agents and models, iteratively refining both code and tests. The seamless orchestration of these agents heavily relies on our proprietary technology, including advanced static analysis tools used directly by our agents. As our agents do not rely on Internet access, there is no risk of leaking your source code, secrets, or your customers' data. Instead, our agents leverage advanced code search capabilities, iterative feedback driven by code execution with coverage metrics, and static analysis tools developed by LogicStar for building codebase understanding.

We are rolling out our latest agent advancements with selected design partners who share our vision for self-healing applications and are helping us shape the future of this technology. Their collaboration ensures that our research delivers immediate value for commercial software. If you also believe in this direction and work with Python, JavaScript or TypeScript repositories, we invite you to join sign up here. We will support you through onboarding and ensure full SOC2 compliance.

min read
SWT-Bench Verified – Best Test Generation at 84%

The L* Agent achieves a new state-of-the-art of 84% on SWT-Bench Verified

Read more
May 22, 2025
-
time

At LogicStar AI, trust, security, and operational excellence are foundational to how we build and deliver our autonomous software maintenance platform.

We’re proud to share that LogicStar has successfully completed a SOC 2 audit conducted by an independent third-party firm, validating the design and implementation of our security controls in alignment with the AICPA Trust Services Criteria.

This achievement reflects our commitment to safeguarding customer data and building secure systems from day one.

Importantly, we’ve also implemented continuous monitoring processes that ensure our controls remain active and effective — not just at a point in time, but throughout our operations.

SOC 2 compliance is one step in our broader mission to build infrastructure our customers and partners can rely on with confidence.

If you’re a customer or vendor and would like to receive a copy of our SOC 2 audit report, please reach out to: info@logicstar.ai

min read
Read more
March 3, 2025
-
time

We’re excited to announce that LogicStar AI has officially joined the ETH Zurich AI Center as an affiliate member🎉! This partnership is a significant milestone for us, especially since many of our team members have deep roots in AI research at ETH.

As a pioneering AI company focused on autonomous software maintenance, this collaboration strengthens our commitment to advancing AI research and innovation. Partnering with one of the world’s leading AI research hubs at ETH Zurich will accelerate our efforts in building cutting-edge AI agents that autonomously detect, reproduce, and fix software bugs-transforming the way engineers maintain commercial applications. At LogicStar we harness AI along classical computer science to empower engineering teams and AI agents to autonomously maintain commercial applications, enabling faster resolution of issues and empowering engineering teams to focus on innovation by automating tedious application maintenance tasks.

What This Means for LogicStar AI & Our Community:
✅ Access to World-Class Research & Talent - Collaborating with ETH Zurich’s AI experts, faculty, and students to push the boundaries of AI-powered software development.
✅ Advancing AI Reliability & Explainability - Working alongside top researchers to refine AI verification and validation techniques, ensuring robust and trustworthy autonomous coding agents.
✅ Stronger AI Ecosystem - Engaging with startups, industry leaders, and academia to shape the future of self-healing software and AI-driven code maintenance.

This partnership marks a major milestone in our mission to revolutionize software reliability. We’re excited about the journey ahead and look forward to working with ETH Zurich’s brilliant minds to make AI a seamless, dependable partner for engineering teams.

As AI is central to our mission, we are at the forefront of AI research and innovation. Being affiliated with the ETH Zurich AI Center allows us to do just that. We’re excited to collaborate as we advance the field of agentic AI for application maintenance together!

min read
ETH AI Center Affiliation

LogicStar AI Joins the ETH AI Center as an Affiliate! 🚀

Read more
February 24, 2025
-
time

LLM-generated applications are here. Some well-known tools now offer to turn anyone into an app developer, while others aim to make current developers more productive. Given the concerns about the security, support, and future development of these quickly made apps, we wanted to measure this with a benchmark. Our focus wasn’t on any specific tools but on the LLM models that power them. We present BaxBench - a benchmark of 392 instances – 28 scenarios for LLMs to implement using 14 different frameworks in 6 languages such as Django, ExpressJS, Flask, Ruby on Rails and others. For leaderboard and more benchmark details, together with the team at ETH Zurich we have created baxbench.com.

We conducted an analysis of backend applications generated by LLMs with a specific focus on assessing their vulnerability exposure to real security exploits. Backends are integral to many systems and vary in complexity; some are designed to manage the entire state of an application, while others are constructed by integrating multiple specialized services, known as microservices. Applications rely on one or more security critical backends to perform tasks such as handling logins, managing application states, and storing user data. To evaluate these systems, we developed BaxBench, which consists of small and frequently seen tasks for application backends. These backends are granted access to a database for storage, and large language models (LLMs) are tasked with generating their logic based on predefined OpenAPI specifications. Our findings revealed that many of these backends were not secure, as we were to execute actual attacks against them. This goes beyond mere analysis or tool-generated warnings about hypothetical security issues - we successfully executed real exploits, including SQL injection, path traversal, and user impersonation. It is crucial to emphasize that our specifications did not suggest any vulnerabilities; the vulnerabilities we exploited arose from the outputs generated by the LLMs.

One interesting point is that we can ask LLMs to fix these vulnerabilities, and they manage to solve many of them. They do best when we tell them exactly what we will be trying to exploit. However, even then, not every vulnerability goes away and there is a trade-off - when security issues are fixed, we measure that some apps stop working properly. This creates a big opportunity for tools like the ones we are building at LogicStar, which can both identify and fix these security issues. And of course, the benchmark is open source, so security and application development experts can help us add more scenarios or new ways to exploit vulnerabilities. In addition to this, we expect that LLMs will also get better thanks to benchmarks like BaxBench.

Looking deeper, it’s clear that correctness and security aren’t the only challenges. LLMs also struggle to create reliable code in different backend frameworks, especially those that aren’t the most popular one. Engineers see firsthand that LLMs can have trouble with complex and varied tasks, which means the results aren’t always perfect. Sometimes, even the best tools get stuck and can’t improve an app by just using more LLM prompts. However, to make progress, you have to start by measuring the problem. With BaxBench, we looked at security, and going forward-at LogicStar, we are focusing on checking and improving how well the models can understand existing apps, fix real problems in supporting them, and ultimately make their and your end users happier.

For more work on maintaining software, you can follow our research, blogs and social media. If you’re running an app with Python backends, we’d love to talk about how our early access product can help maintain that app by fixing bugs.

Need more information? Have a look at the paper and the baxbench.com website.

Please file issues or contribute to the benchmark code.

min read
Introducing BaxBench

BaxBench: Can LLMs Generate Secure and Correct Backends?

Read more
February 4, 2025
-
time

On - [4th February, 2025] - TechCrunch’s senior reporter Natasha Lomas wrote this article about LogicStar.

The text of the article is quoted below: “ Swiss startup LogicStar is bent on joining the AI agent game. The summer 2024-founded startup has bagged $3 million in pre-seed funding to bring tools to the developer market that can do autonomous maintenance of software applications, rather than the more typical AI agent use-case of code co-development.

LogicStar CEO and co-founder Boris Paskalev (pictured top right, in the feature image, with his fellow co-founders) suggests the startup’s AI agents could end up partnering with code development agents - such as, say, the likes of Cognition Labs’ Devin - in a business win-win.

Code fidelity is an issue for AI agents building and deploying software, just as it is for human developers, and LogicStar wants to do its bit to grease the development wheel by automatically picking up and fixing bugs wherever they may crop up in deployed code.

As it stands, Paskalev suggests that “even the best models and agents” out there are unable to resolve the majority of bugs they’re presented with - hence the team spying an opportunity for an AI startup that’s dedicated to improving these odds and delivering on the dream of less tedious app maintenance.

To this end, they are building atop large language models (LLMs) - such as OpenAI’s GPT or even China’s DeepSeek - taking a model-agnostic approach for their platform. This allows LogicStar to dip into different LLMs and maximize its AI agents’ utility, based on which foundational model works best for resolving a particular code issue.

Paskalev contends that the founding team has the technical and domain-specific knowledge to build a platform that can resolve programming problems which can challenge or outfox LLMs working alone. They also have past entrepreneurial success to point to: he sold his prior code review startup, DeepCode, to cybersecurity giant Snyk back in September 2020.

“In the beginning we were thinking about actually building a large language model for code,” he told TechCrunch. “Then we realized that that will quickly become a commodity… Now we’re building assuming all those large language models are there. Assuming there’s some actually decent [AI] agents for code, how do we extract the maximum business value from them?”

He said that the idea built on the team’s understanding of how to analyze software applications. “Combine that with large language models - then focus into grounding and verifying what those large language models and the AI agent actually suggest.”

Test-driven development What does that mean in practice? Paskalev says LogicStar performs an analysis of each application that its tech is deployed on - using “classical computer science methods” - in order to build a “knowledge base”. This gives its AI agent a comprehensive map of the software’s inputs and outputs; how variables link to functions; and any other linkages and dependencies etc.

Then, for every bug it’s presented with, the AI agent is able to determine which parts of the application are impacted - allowing LogicStar to narrow down the functions needing to be simulated in order to test scores of potential fixes.

Per Paskalev, this “minimized execution environment” allows the AI agent to run “thousands” of tests aimed at reproducing bugs to identify a “failing test”, and - through this “test-driven development” approach - ultimately land on a fix that sticks.

He confirms that the actual bug fixes are sourced from the LLMs. But because LogicStar’s platform enables this “very fast executive environment” its AI agents can work at scale to separate the wheat from the chaff, as it were, and serve its users with a shortcut to the best that LLMs can offer.

“What we see is [LLMs are] great for prototyping, testing things, etc, but it’s absolutely not great for [code] production, commercial applications. I think we’re far from there, and this is what our platform delivers,” he argued. “To be able to extract those capabilities of the models today, we can actually safely extract commercial value and actually save time for developers to really focus on the important stuff.”

Enterprises are set to be LogicStar’s initial target. Its “silicon agents” are intended to be put to work alongside corporate dev teams, albeit at a fraction of the salary required to hire a human developer, handling a range of app upkeep tasks and freeing up engineering talent for more creative and/or challenging work. (Or, well, at least until LLMs and AI agents get a lot more capable.)

While the startup’s pitch touts a “fully autonomous” app maintenance capability, Paskalev confirms that the platform will allow human developers to review (and otherwise oversee) the fixes its AI agents call up. So trust can be - and must be - earned first.

“The accuracy that a human developer delivers ranges between 80 to 90%. Our goal [for our AI agents] is to be exactly there,” he adds.

It’s still early days for LogicStar: an alpha version of its technology is in testing with a number of undisclosed companies which Paskalev refers to as “design partners”. Currently the tech only supports Python - but expansions to Typescript, Javascript and Java are billed as “coming soon”.

“The main goal [with the pre-seed funding] is to actually show the technology works with our design partners - focusing on Python,” adds Paskalev. “We already spent a year on it, and we have lots of opportunity to actually expand. And that’s why we’re trying to focus it first, to show the value in one case.”

The startup’s pre-seed raise was led by European VC firm Northzone, with angel investors from DeepMind, Fleet, Sequoia scouts, Snyk and Spotify also joining the round.

In a statement, Michiel Kotting, partner at Northzone, said: “AI-driven code generation is still in its early stages, but the productivity gains we’re already seeing are revolutionary. The potential for this technology to streamline development processes, reduce costs, and accelerate innovation is immense. and the team’s vast technical expertise and proven track record position them to deliver real, impactful results. The future of software development is being reshaped, and LogicStar will play a crucial role in software maintenance.”

LogicStar is operating a waiting list for potential customers wanting to express interest in getting early access. It told us a beta release is planned for later this year. “

min read
TechCrunch Article About LogicStar

A TechCrunch article about us titled LogicStar is building AI agents for app maintenance

Read more
December 18, 2024
-
time

SWT-Bench: Benchmarking CodeAgents’ Test Generation Capabilities

As the complexity of modern software systems grows, so does the challenge of ensuring their reliability. To this end, rigorous testing plays a critical role in maintaining high software quality. However, while the rise of Large Language Models (LLMs) has catalyzed advancements in code generation, their potential in test automation remains underexplored. Enter SWT-Bench, a novel benchmark for test generation based on real-world GitHub issues, developed in collaboration with ETH Zurich. With the release of a public leaderboard at swtbench.com, we aim to spark a similar push from the research community on test generation as SWE-Bench caused for code generation.

What is SWT-Bench?

SWT-Bench is a test generation benchmark based on real-world GitHub issues. The objective is to generate a test reproducing the described issue given the full codebase. We determine whether a test reproduces an issue by checking whether it fails on the original codebase but passes after a human-written ground truth fix, taken from a corresponding pull request (PR), has been applied. We call this the success rate \mathcal{S} Additionally, we measure the coverage \Delta \mathcal{C} of the lines modified in this ground truth bug fix to further assess the test quality.

How did we create SWT-Bench?

Starting with over 90,000 PRs from 12 popular GitHub repositories, we applied rigorous filtering to obtain 1,900 diverse and high-quality instances. SWT-Bench thus reflects the complexity of modern software ecosystems, challenging AI systems to navigate large codebases (up to 700k lines), interpret nuanced issue descriptions (320 words average), and integrate tests into diverse existing test suites and frameworks (from pytest to tox to custom frameworks).

First Results

Performance of Code Agents

We found that Code Agents, originally designed for program repair (e.g. SWE-Agent), perform well on test-generation tasks, even outperforming dedicated test-generation methods (LIBRO). However, even minimal modifications like explicitly instructing the agent to execute the generated tests (SWE-Agent+) significantly improve performance further. This highlights the potential of dedicated Code Agents for test generation.

A new Patch Format for Test Generation

Based on the insight that test generation is typically solved by adding a new (test) function or class, we propose a novel patch format tailored for fault tolerance and simplicity. This format alone, allows vanilla LLMs to generate executable tests in twice as many cases (ZeroShot vs ZeroShotPlus) leading to almost 3 times as many solved instances.

Utility of generated tests:
Automatically generating high-quality tests not only allows developers to focus on (test-driven) development generating real business value but can also boost the performance of code generation agents. In particular, the generated tests can guide them along the whole generation process from informing context localization to bug fix validation. Early results show that simply using generated tests to filter proposed bug-fixes can more than double the achieved precision.

Correlation of Test and Fix Generation
While we observe that Code Agents who perform well on code generation also perform well on test generation, we interestingly doe not see such a correlation for individual issues. That is an issue that is easy to fix is not necessarily easy to test and vice versa. Indeed, we see no statistically significant correlation between the hardness/resolution rate of these tasks, highlighting the unique challenges of test generation.

Implications for the Future of Software Maintenance

SWT-Bench demonstrates the capability of LLMs to interpret and formalize the intent of natural language issue descriptions into tests. This has the potential to in the long run significantly improve software quality by making thorough testing attainable without significant manual efforts. In a next step, this can even enable self-healing systems by automatically detecting, reproducing, and resolving issues in real-time, as they appear, minimizing downtime and increasing reliability.

We at LogicStar AI believe that reliable automated testing is the key to unlocking the real potential of Code Agents and will be essential to push the frontier in automated application maintenance. Therefore, we are extra excited to see the great interest of the community in SWT-Bench and hope that our public leaderboard can make it even more accessible.

For more details, check out our NeurIPS paper (https://arxiv.org/pdf/2406.12952) or our open-source code (https://github.com/logic-star-ai/swt-bench.

min read
Introducing the SWT-Bench Leaderboard!

SWT-Bench Benchmarking CodeAgents' Test Generation Capabilities

Read more
December 5, 2024
-
time

Researchers and entrepreneurs from INSAIT and ETH Zurich have launched LogicStar AI, a new deep-tech startup, which is building fully autonomous agentic AI that helps teams maintain their software.

Founding Team
The founding team behind LogicStar AI is star-studded and includes the founders of DeepCode.ai (now Snyk Code) which is currently delivering more than $100M ARR for Snyk. The LogicStar AI founders are:

🌟 Boris Paskalev (CEO), formerly CEO of DeepCode, then Director of Product AI at Snyk, and currently a Strategic Entrepreneurship Advisor at INSAIT.
🌟 Dr. Mark Niklas Müller (CTO), AI PhD from ETH Zurich, formerly an engineer at Porsche and Mercedes AMG Petronas F1 Team.
🌟 Dr. Veselin Raychev (Chief Architect), formerly CTO of DeepCode, then Head of AI at Snyk and researcher at INSAIT.
🌟 Prof. Martin Vechev (Adviser), full professor at ETH Zurich, founder and scientific director of INSAIT.

LogicStar AI Mission
LogicStar AI is building an agentic AI platform for automatically validating, reproducing, and fixing bugs with high precision. Their technology empowers engineering teams to focus on creating new features, driving growth and innovation, by reducing the burden of maintenance and debugging issues. With LogicStar AI, developers can thus spend more of their time delivering real business value, while LogicStar AI reliably addresses software maintenance problems without manual intervention.

INSAIT’s mission
INSAIT (https://insait.ai) is a world-class computer science and AI research institution, founded in 2022 in partnership with Switzerland’s ETH Zurich and EPFL. The focus of INSAIT is on conducting world-class research and attracting outstanding faculty, research scientists, postdocs, and PhD students. In the short time since its inception, INSAIT has published over 50 papers in all major AI venues, as well as in premier theory conferences.

Join the Journey
As LogicStar AI embarks on this exciting new chapter, we invite talented individuals to join our team and shape the future of reliable AI for code and software applications. For more information on career opportunities, partnerships, or to learn about our innovative solutions, please visit our website and follow LogicStar on LinkedIn. LogicStar AI has offices in both Sofia, Bulgaria and Zurich, Switzerland.

min read
Agentic AI from INSAIT and ETH Zurich

INSAIT and ETH Zurich Entrepreneurs launch LogicStar AI, a new Agentic AI startup

Read more
October 17, 2024
-
time

We are excited to share SWT-Bench, the first benchmark for reproducing bugs and validating their fixes based on GitHub issue descriptions. We presented SWT-Bench at two ICML workshops and want to thank everyone who stopped by for their interest, enthusiasm, and the great discussions we had. We now see a community trend to not only focus on fixing bugs but also generating tests that can effectively reproduce them and validate that proposed fixes truly resolve the issues. We believe this is essential for achieving truly autonomous bug fixing, which is what LogicStar delivers.

In our paper, we demonstrate how any code repair benchmark with a known ground truth solution can be transformed into a test generation and issue reproduction benchmark. There, the goal is to create a “reproducing test” that fails on the original codebase and passes after the ground truth fix has been applied. Our analysis shows that Code Agents excel in this task and outperform dedicated LLM-based test generation methods. Leveraging these tests for code repair further allows us to significantly enhance precision. To learn more, please check out our preprint paper.

LogicStar AI builds on top of this research to achieve a truly autonomous bug fixing that you can trust as you trust your top engineers.

min read
SWT-Bench

A Benchmark for Testing and Validating Bugfixes

Read more
July 1, 2024
-
time

Zurich, Switzerland - [4th February, 2025] - LogicStar, the AI agent for fully autonomous software maintenance, has raised $3M (CHF 2.6M) in a pre-seed funding round led by Northzone, with angel investors from DeepMind, Snyk, Spotify, Fleet and Sequoia scouts. LogicStar empowers engineering teams to focus on innovation by automating tedious application maintenance tasks.

LogicStar is revolutionising software maintenance with its autonomous code agent, designed to deliver self-healing software applications that empower engineers to focus on innovation and growth. LogicStar works seamlessly alongside human developers, autonomously reproducing application issues, testing solutions and proposing precise fixes without the need for human oversight. The world’s rapid adoption of AI has sparked a wave of pivotal trends that are reshaping industries and workflows. Organisations relying on custom software spend considerable resources and time on maintenance and bug fixes, which divert developers from innovation. AI coding agents, despite performing well on benchmarks and simple tasks, tend to introduce errors in complex settings, leaving teams stuck with tedious maintenance tasks.

The founding team consists of Boris Paskalev, Veselin Raychev, Mark Müller, and Prof. Dr. Martin Vechev. Boris, Veselin, and Martin previously built DeepCode.ai (acquired by Snyk and now called Snyk Code) and scaled it to over $100M ARR: a technology trusted by millions of developers. Martin also leads the Secure, Reliable, and Intelligent Systems (SRI) lab at ETH Zurich and is the Founder and Scientific Director of INSAIT. The unique technology behind LogicStar draws from the team’s deep research background and expertise from ETH Zurich, MIT, TRIUM, and INSAIT, resulting in over 20,000 citations and 350 top publications in AI and program analysis, particularly in large codebases and software development.

LogicStar has already released SWT-Bench to support the development of code agents and demonstrated that existing code agents are not up to the challenge of enterprise code bases, failing on >95% of issues. Using an advanced mock execution environment, LogicStar swiftly runs generated tests to reproduce issues and confirm solutions - spotting errors before you’re aware of them. At the core of LogicStar’s technology lies a blend of the latest advancements in LLMs for code and classical computer science techniques. The platform is rapidly evolving, with Python support already available and expansions to Typescript, Javascript, and Java coming soon. Technology leaders managing commercial software systems are invited to join the waiting list to experience the benefits of LogicStar firsthand.

Boris Paskalev comments, “I am excited that developers can focus on innovation and creativity while automation handles the burden of application maintenance. Our platform eliminates the need to oversee the current generation of agents and LLMs in maintaining commercial software. Providing an evolving solution that seamlessly grows along LLM advancements and maximises successful task completion.”

Michiel Kotting, partner at Northzone adds, “AI-driven code generation is still in its early stages, but the productivity gains we’re already seeing are revolutionary. The potential for this technology to streamline development processes, reduce costs, and accelerate innovation is immense. and the team’s vast technical expertise and proven track record position them to deliver real, impactful results. The future of software development is being reshaped, and LogicStar will play a crucial role in software maintenance.”

About Logic Star
LogicStar is the AI agent for fully autonomous application maintenance. Founded in 2024, the company is headquartered in Zurich, Switzerland and is backed by global venture firm Northzone, as well as angels from DeepMind, Snyk, Spotify, Fleet, and Sequoia Scouts.

About Northzone
Northzone (northzone.com) is a global venture capital fund built on experience spanning multiple economic and disruptive technology cycles. Founded in 1996, Northzone has raised more than ten funds to date, with its most recent fundraise in excess of $1.2 billion and has invested in more than 175 companies, including category-defining businesses such as Trustpilot, Spotify, Klarna, iZettle, Kahoot!, Personio, TrueLayer, Spring Health, amongst others. Northzone is a full-stack investor from Seed to Growth stage, with transatlantic hubs out of London, New York, Amsterdam, Berlin, Stockholm and Oslo.

min read
LogicStar AI raised a $3m round led by Northzone

LogicStar, building the AI agent for fully autonomous application maintenance, raised a $3m round led by Northzone.

Read more
July 1, 2024
-
time

LogicStar AI is looking for passionate software engineers to join our team. We are a team of researchers, engineers, and product people that focus on cutting edge research and quickly bringing it to a product. If you are interested in working with us, please review our jobs on our careers page and send your resume to jobs@logicstar.ai.

min read
Jobs

We are looking for passionate software engineers to join our team

Read more
April 11, 2024
-
time

🚀 We are thrilled to introduce LogicStar, a pioneering deep-tech startup based in Switzerland, revolutionizing application monitoring and maintenance.

Our cutting-edge platform blends AI with proven computer science methodologies to create agentic AI-tailored application mocks. These mocks reproduce software bugs in an AI-powered mock execution environment, enabling scalable evaluation and verification of AI-driven fix suggestions.

Our exceptional team comprises experts and top researchers from ETH Zurich, INSAIT, and MIT, alongside seasoned serial entrepreneurs, united by a shared mission to redefine the future of software reliability.

✨ Join us on this transformative journey as we push the boundaries of network monitoring and maintenance with groundbreaking innovation.

min read
Introducing LogicStar

We are excited to announce the launch of LogicStar AI, our startup to revolutionize application monitoring.

Read more
LogicStar AI logo – autonomous software maintenance and self-healing applications

Stop Drowning in Bugs. Start

Shipping Features Faster.

Join the beta and let LogicStar AI clear your backlog while your team stays focused on what matters.

No workflow changes and no risky AI guesses. Only validated fixes you can trust.

Screenshot of LogicStar generating production-ready pull requests with 100 percent test coverage, static analysis, and regression validationScreenshot of LogicStar generating production-ready pull requests with 100 percent test coverage, static analysis, and regression validation