Published Research
Test Flimsiness: Characterizing Flakiness Induced by Mutation to the Code Under Test
About
Flaky tests—those that fail non-deterministically—are a major hurdle in software engineering. This paper introduces test FLIMsiness: flakiness induced specifically by code mutations. This is a study of 28 Python projects. It reveals that flimsiness exists in 54% of cases. By mutating the code under test, this research uncovered significantly more flaky tests than standard rerunning strategies (a median of 740 vs. 163), suggesting that mutation is a powerful, overlooked tool for exposing hidden non-determinism.
Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures
About
This paper introduces the concept of systemic flakiness—the phenomenon where flaky tests fail in clusters due to shared root causes rather than in isolation. By analyzing 24 Java projects, the researchers found that 75% of flaky tests belong to a cluster, with an average of 13.5 tests per group. This challenges the long-held assumption that flaky tests are independent events. The study identifies networking issues and unstable external dependencies as the primary drivers of these clusters. To help developers save time and costs, the paper proposes a machine learning approach using static test distance measures to identify these clusters efficiently, allowing teams to fix multiple flaky tests simultaneously by addressing a single root cause.
Do Automatic Test Generation Tools Generate Flaky Tests?
About
While most research focuses on manual tests, this paper investigates flakiness in automated test generation (EvoSuite and Pynguin). Analyzing 6,356 projects, the study find that generated tests are as flaky as developer-written ones but fail for different reasons, such as internal randomness and runtime optimizations. We demonstrate that suppression mechanisms can reduce this flakiness by 71.7%.
Empirically Evaluating Flaky Test Detection Techniques Combining Test Case Rerunning and Machine Learning Models
About
Developers usually choose between slow, accurate rerunning or fast, approximate machine learning for flaky test detection.This paper introduces CANNIER, a hybrid approach that reduces the time cost of rerunning-based detection by an order of magnitude. Evaluating nearly 90,000 test cases, the study demonstrate that CANNIER maintains high detection accuracy while significantly outperforming pure ML models.
What Do Developer-Repaired Flaky Tests Tell Us About the Effectiveness of Automated Flaky Test Detection?
About
This study questions the effectiveness of automated rerunning, the industry standard baseline for detecting flakiness. By analyzing 75 real-world developer commits that repaired flakiness, we found that rerunning only detected the issue in 40% of cases. This suggests a major gap between automated detection and the actual flaky tests that developers prioritize for repair.
Surveying the Developer Experience of Flaky Tests
About
This survey bridge the gap between academic research and industry experience through a multi-source study of 170 developers and 38 StackOverflow threads. Key findings reveal that developers view setup and teardown issues as the primary cause of flakiness and that frequent exposure to flaky tests leads to a dangerous "normalization of deviance," where developers begin to ignore genuine test failures.
Evaluating Features for Machine Learning Detection of Order- and Non-Order-Dependent Flaky Tests
About
Machine learning offers a fast alternative to expensive flaky test detection, but its potential is often limited by poor data encoding. This research introduce FLAKE16, a new feature set designed to detect both standard and Order-Dependent (OD) flaky tests. Evaluating 26 Python projects, FLAKE16 outperformed existing feature sets by increasing detection accuracy (F1 score) for non-order-dependent tests by 13% and order-dependent tests by 17%.
A Survey of Flaky Tests
About
This study provides a comprehensive overview of the flaky test landscape by systematically analyzing 76 core research papers. The paper categorize the current body of knowledge into four pillars: root causes, costs and consequences, detection strategies, and repair approaches. This survey serves as a foundational resource for practitioners and researchers to understand how flakiness threatens the validity of modern software testing.
Flake It ’Till You Make It: Using Automated Repair to Induce and Fix Latent Test Flakiness
About
The most efficient time to fix a flaky test is at the moment of creation, yet many tests possess "Latent Flakiness", non-determinism that exists but has not yet manifested. The paper argues that ignoring latent flakiness leads to higher long-term costs and degraded test suites. This study explores how automated program repair (APR) techniques can be used to proactively surface and fix these hidden issues before they disrupt the development pipeline.