Automatically detecting redundant tests

Toy working in high-visibility clothing sweeping up the seeds of a dandelion. The dandelion is looming over the worker. — Image by magica from Pixabay

It’s common for test suites to grow over time, but removing tests is rare. Because of this, some tests run every time without adding any value. All these tests do is waste time and increase the test maintenance load. With this in mind, I attempted to find a way to automatically detect redundant tests by harnessing the power of mutation testing.

What is a redundant test?

A redundant test adds no value to our test suite and, by extension, our application. They do, however, take time to run and maintain.

I consider a test to be redundant when one of the following statements is true:

The test can’t fail
The test has a duplicate test covering the same logic

Both of these statements are hard to detect automatically, but I’ll try in this experiment. My hypothesis for this experiment is:

Hypothesis: Infallible and duplicate tests never add value to a test suite and, by extension, the quality of the application under test.

The power of mutation testing

Mutation testing is a big topic in itself, but here are the basics. Mutation testing is a testing approach where we automatically create bugs (called mutants) in our source code and see if any of our unit tests fail because of it. If the mutant does not fail any test, we might want to add a test to cover that situation. The mutation test runner generates many mutants in a test run and runs many tests against each mutant. Because of this, mutation testing is very CPU-intensive, even with all the excellent optimizations made by the tooling authors.

Mutation testing results in a dataset where for each test is recorded which mutants it covers. Normally, you use this dataset to calculate mutation coverage. However, we can also use the dataset for other purposes. Most relevant for this experiment is that the data can tell us the following:

For each test, how often did it fail?
The mutation test runner generates many bugs during a run. Under these conditions, any test that can fail will fail.
Exceptions apply: Mutation test tools generate most bug types but not all.
For each test, which mutants does it cover?
Allowing us to compare between tests to find duplicates. Two tests that cover the same mutants cover the same logic and are thus duplicates.

Barriers to success

The biggest barrier to success is that mutation testing is slow. Practically speaking, it’s too slow for anything but unit tests. Future hardware improvements and tooling optimizations can alleviate this issue a bit, but test duration will always be a concern due to the fundamental nature of mutation testing.

Secondly, mutation testing is slow. Yes, again, that’s how slow it is. There are stories about large codebases with mature test suites where mutation test runs take days. At the same time, large codebases with mature test suites would benefit most from detecting redundant tests, making this a difficult proposition in the real world.

Let’s give it a shot

Enough theory. I wrote a proof-of-concept for detecting both infallible and duplicate tests.

To test the concept, I took the codebase of the open-source project Express and added mutation testing with Stryker.

Express is a popular HTTP server framework with a 12-year-old JavaScript codebase containing an excellent unit test suite of 1145 tests (at the time of writing). I’ve picked Express because I know it has a comprehensive unit test suite, giving us a better chance of success.

Stryker is a mutation test runner. It’s a project for mutation testing in JavaScript, C#, and Scala. I used the JavaScript version of Stryker.

Because of how Stryker works, the mutation coverage can change slightly between runs. To counter this, I ran each test 10 times and included the measurement uncertainty in the results below. More data would be better, but each of the ten runs takes between 10 and 20 minutes. It adds up.

Test 1: Baseline

Run all tests without any modifications.

|                                                  | Result      |
| ------------------------------------------------ | ----------- |
| Total tests                                      | 1145        |
| Total mutants                                    | 1963        |
| Mutation coverage                                | 88.46±0.08% |
| Tests that didn't fail (0 mutants covered)       | 13          |
| Tests with duplicate test (same mutants covered) | 821         |

The mutation coverage of 88% is a very high score. It is rare for any application to get close to 80% mutation coverage as it requires very high testing standards.

We can also see that some tests never failed, and there are a lot of duplicate tests.

Test 2: Skip unfailed tests

In the first test, 13 tests never failed. I disabled these tests.

|                                                  | Result      | Change     |
| ------------------------------------------------ | ----------- | ---------- |
| Total tests                                      | 1132        | -13        |
| Total mutants                                    | 1963        | 0          |
| Mutation coverage                                | 88.41±0.08% | -0.05±0.11 |
| Tests that didn't fail (0 mutants covered)       | 0           | -13        |
| Tests with duplicate test (same mutants covered) | 821         | 0          |

Disabling these 13 tests does not change the mutation coverage significantly. The difference is within measurement uncertainty.

This data supports the hypothesis. However, we need more data before we can definitively state that removing infallible tests does not impact test suite quality.

Most of the disabled tests ensure that the public API of Express exists. Styker can’t generate this bug type. A bug in the public API will be caught in other test types that actively use the public API. The remaining disabled tests cover the functionality of a dependency.

None of the disabled tests seem incorrectly marked as infallible.

Test 3: Skip some duplicate tests

In the previous tests, there were a lot of duplicate tests. I disabled 10 of those tests.

For this proof-of-concept, I picked only simple duplicates. These are situations where two tests cover the exact same mutants. I disabled one of those tests. Detecting complex duplicates is possible but requires more implementation effort, so I’ll leave it for now.

|                                                  | Result      | Change     |
| ------------------------------------------------ | ----------- | ---------- |
| Total tests                                      | 1122        | -10        |
| Total mutants                                    | 1963        | 0          |
| Mutation coverage                                | 88.46±0.08% | +0.05±0.11 |
| Tests that didn't fail (0 mutants covered)       | 0           | 0          |
| Tests with duplicate test (same mutants covered) | 801         | -20        |

Disabling these ten tests does not change the mutation coverage significantly. The difference is within measurement uncertainty.

This data supports the hypothesis. However, we need more data before we can definitively state that removing duplicate tests does not impact test suite quality.

Data summary

The gathered data supports the hypothesis. The mutation coverage has barely changed after disabling 23 tests (13 unfailed + 10 duplicate). However, the dataset is small, so we should not draw hard conclusions before gathering more data.

Summary of all data. Image contains no new datapoints

Conclusion

The data gathered during this experiment tells us that we can create a tool to help us reduce the size of our low-level test suites without significantly compromising the suite quality. We can do this by using our definition of redundant tests and the data generated by mutation testing.

The usefulness of such a tool in the real world is debatable. The codebases that would benefit most will also have the hardest time using this tool. The CPU-intensive nature of mutation testing means this tool can take hours or even days to run. Throughout this experiment, I’ve been going back and forth on whether it’s worth building this tool or if it’s not worth the effort.

This approach is too CPU-intensive for day-to-day use, but we might use it occasionally. It might make sense to run this tool once a year. A test suite spring cleaning, if you will.

Do you think it’s worth investing time in building this tool? What would you use it for?

Reproduce it yourself

NodeJS v18.13.0
Express commit 33e8dc303af9277f8a7e4f46abfdcb5e72f6797b
StrykerJS v6.3.1 with configuration:

{
 "testRunner": "mocha",
 "coverageAnalysis": "perTest",
 "reporters": ["html", "progress", "json"],
 "ignoreStatic": true
}

Run StrykerJS via npx stryker run
Run the script below with NodeJS

import { readFile } from "node:fs/promises";
import { calculateMetrics } from "mutation-testing-metrics"; // A dependency of Stryker

const inputPath = "./reports/mutation/mutation.json";
const input = JSON.parse(await readFile(inputPath, "utf-8"));

const metrics = calculateMetrics(input.files).metrics;
console.log("Total mutants: " + metrics.totalValid);
console.log("Mutation coverage: " + metrics.mutationScore.toFixed(3) + "%");

// -----------------

const allMutants = Object.values(input.files)
  .map((file) => file.mutants)
  .flat();

const allTests = Object.entries(input.testFiles)
  .map(([filePath, file]) =>
    file.tests.map((t) => {
      return { ...t, filePath };
    })
  )
  .flat()
  .map((test) => {
    return {
      ...test,
      killedMutants: allMutants
        .filter((mutant) => mutant.coveredBy.includes(test.id))
        .sort((a, b) => {
          if (a.id > b.id) return -1;
          if (a.id < b.id) return 1;
          return 0;
        }),
    };
  });
console.log("Total tests: " + allTests.length);

// -----------------

const neverFailed = allTests.filter((t) => t.killedMutants.length === 0);

console.log("Tests that can't fail (0 mutants killed): " + neverFailed.length);
// console.log(neverFailed);

// -----------------

const testsWithDuplicates = allTests
  .filter((t) => t.killedMutants.length > 0)
  .map((thisTest) => {
    return {
      ...thisTest,
      duplicates: allTests.filter((otherTest) => {
        if (otherTest.id === thisTest.id) {
          return false;
        }
        if (otherTest.killedMutants.length !== thisTest.killedMutants.length) {
          return false;
        }

        // Crude and inefficient way to compare which mutants are covered by each test
        const thisKilledMutants = JSON.stringify(
          thisTest.killedMutants.map((m) => m.id)
        );
        const otherKilledMutants = JSON.stringify(
          otherTest.killedMutants.map((m) => m.id)
        );
        const match = thisKilledMutants === otherKilledMutants;

        return match;
      }),
    };
  })
  .filter((test) => test.duplicates.length > 0);

console.log(
  "Tests with at least 1 duplicate (same mutants covered): " +
    testsWithDuplicates.length
);
// console.log(testsWithDuplicates);