Published Research

2026

Test Flimsiness: Characterizing Flakiness Induced by Mutation to the Code Under Test

Parry, Kapfhammer, Hilton, McMinn · ICSE

About

Flaky tests—those that fail non-deterministically—are a major hurdle in software engineering. This paper introduces test FLIMsiness: flakiness induced specifically by code mutations. This is a study of 28 Python projects. It reveals that flimsiness exists in 54% of cases. By mutating the code under test, this research uncovered significantly more flaky tests than standard rerunning strategies (a median of 740 vs. 163), suggesting that mutation is a powerful, overlooked tool for exposing hidden non-determinism.

2025

Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures

Parry, Kapfhammer, Hilton, McMinn · EASE

About

This paper introduces the concept of systemic flakiness—the phenomenon where flaky tests fail in clusters due to shared root causes rather than in isolation. By analyzing 24 Java projects, the researchers found that 75% of flaky tests belong to a cluster, with an average of 13.5 tests per group. This challenges the long-held assumption that flaky tests are independent events. The study identifies networking issues and unstable external dependencies as the primary drivers of these clusters. To help developers save time and costs, the paper proposes a machine learning approach using static test distance measures to identify these clusters efficiently, allowing teams to fix multiple flaky tests simultaneously by addressing a single root cause.

2024

Do Automatic Test Generation Tools Generate Flaky Tests?

Gruber et al. · ICSE

About

While most research focuses on manual tests, this paper investigates flakiness in automated test generation (EvoSuite and Pynguin). Analyzing 6,356 projects, the study find that generated tests are as flaky as developer-written ones but fail for different reasons, such as internal randomness and runtime optimizations. We demonstrate that suppression mechanisms can reduce this flakiness by 71.7%.