To test or not to test?

Published on Monday, August 31st 2020 at 10:24
Last updated on Thursday, March 21st 2024 at 15:58

As software developers, we strive to produce the best quality code that not only accomplishes its goal but does so in a reliable maintainable way. That’s why we write tests: to prove that our work is solid and it satisfies expectations, even when its inputs are inconsistent. As we know, there are many different types of tests, each suited to fit a particular layer of the application, or to encompass a specific level of abstraction: unit tests, system tests, integration tests, end-to-end tests, snapshot tests, etc… Regardless of its type, though, we can agree that any test should examine and prove a specific behaviour, independently of its implementation. That is where many of us lose track of what tests really are.

In the following paragraph, I will show how several types of tests may be abused. Pain points related to each one may be addressed through a paradigm shift in mentality.

The overzealous mocker

Over the years, we have all been taught the same thing, especially in the context of object oriented programming: unit tests are the simplest and most fundamental way to assess a piece of software. A unit is a well-isolated single-minded piece of code that addresses a well-defined concern. Naturally, everyone agrees that the smallest testable unit is a class. Now, I would like to challenge that assumption.
Let’s take a big application as our example. The backend is a java web-server that exposes a number of APIs to access and modify data from the system. It is written using all the canonical tools for the job:

The spring-boot framework holds everything together providing many facilities to implement the most common concerns: RESTful controllers, security, dependency injection, caching, scheduling, application events, etc…
Hibernate handles the persistence layer, allowing easy access to data repositories, transactions, record-entity mapping, native query, high-level queries, etc…
Classes are split into horizontal slices over domain-specific entity types (e.g. users, reports, orders); and into vertical slices over layer-specific functionality (e.g. Repositories, Services, Models, Controllers, etc…)

Given the breadth and size of an application of this kind, dependency injection stands out as an important tool to easily access various bit of functionality across the codebase. The dependency container is akin to glue that holds all classes together, while at the same time giving each other a way to communicate with nearby siblings.
The interconnectedness of such an application, however, is detrimental to the testing experience. If we are supposed to enact the good practice of unit testing, then how are going to test a single unit, when it is so tightly connected to other units via dependency injection? The answer comes in roaring under the fierce title of “mocking”. Mocking allows us to replace the real implementation of one or more of a class’ dependencies with a fake one, with limited to no actual functionality. Then a unit test morphs into a different shape:

It makes several “assumptions” by laying out fictitious scenarios by which mocked components will abide;
Once assumptions are in place, it sets things into motion by invoking one or more methods from the unit under test;
Then, it observes the results by assessing how mocked objects were manipulated during invocation lifecycle.

Here is an example similar to code I have worked on:

Mocking bananas

@ExtendWith(MockitoExtension.class)
class BananaServiceTest {
    @Mock
    private BananaRepository bananaRepository;

    @InjectMocks
    private BananaService bananaService;

    @Test
    public void testFetch() {
        var targetBanana = new Banana(89L, "Cavendish");
        when(bananaRepository.findById(targetBanana.getId())).thenReturn(Optional.of(targetBanana));

        var fetchedBanana = bananaService.fetch(targetBanana.getId());
        assertThat(fetchedBanana).isEqualTo(targetBanana);

        verify(bananaRepository).findById(targetBanana.getId());
        verifyNoMoreInteractions(bananaRepository);
    }
}

So far so good. Let’s say this application has many unit tests with an abundance of mocking. Eventually, we reach a point in which we are no longer satisfied by the design of the codebase and decide to refactor it. This is quite a common occurrence that may be due to accrued technical debt or simply to the natural growth of the code. By virtue of definition, refactoring is a process that restructures implementation, hopefully by improving code quality, while leaving its behaviour unchanged. Here’s the crux though. If we try and refactor the application’s code, we are bound to change how dependencies are used, either by altering the number or type of those dependencies, or by changing which of their methods are invoked. In the example above, for example, we may replace findById with a default interface method called findOne, which automatically handles missing results. This is something that really happened during my job. Since we have heavily mocked unit tests, refactoring will inevitably break them. In the code above, since bananaRepository is mocked, no call to findById will be made, as it will be replaced with a call to the new method findOne. But isn’t testing supposed to test behaviour? Refactoring does not change behaviour, so what is going on?

The real and harsh truth is that through the obsessive practice of unit testing single classes we ended up testing the implementation, not the behaviour!

That begs the question: are we actually testing the right thing? You may not care now, but the amount of time spent rewriting tests will inevitably increase, slowing you down; and if there is one metric that people care about a lot, that’s velocity. Beware, taking time to ensure proper code quality and correctness is good. Rewriting the same test suite for the seventh time, however, is not a smart use of anyone’s time!

So, how do we fix this? We should change our perspective on testing. Coverage is just a number and in spite of its magnitude, it does not give any functional guarantee. With an analogy, you may have visited all europe, but if you did so by staying inside a bus that never stopped, you merely saw it, but you never really appreciated or lived its cities, landscapes and culture.

Unit testing is useful, but a unit may not necessarily be a single class. It may be an ensemble of classes that are tightly coupled together. Find those connected components in your code graph and treat them as a single thing. Remember good design principles: what changes together belongs together! With this paradigm shift, we may find that some tightly bound unit cannot actually be united tested as they depend on external components: the most eminent example is persistence, as in the example above. In that case, we inevitably have to accept that some sections of our code are best tested through integration or system testing. There is no shame in not having unit tests for everything. Actually, it may be counter-productive at times.

Don’t be an overzealous tester: write tests that provide real value, not just coverage. Most importantly, write tests that fail!

Forgotten snapshots

In a snapshot test, the code under test produces some kind of resource that can be serialized into a plain text representation called a snapshot. The latter is then compared to a pre-computed snapshot stored in the codebase, with the expectation for them to be equal. A difference implies something changed unexpectedly. This does not necessarily indicate a bug, but rather something that must be ascertained before committing any code update.

Snapshot testing has a variety of uses. However, here I want to focus mainly on UI snapshots, which most will be familiar with thanks to Jest. To contextualize examples, I will consider the UI to be built around a component-centric front-end framework like React. I am confident it won’t cause any loss of generality given the overwhelming popularity of this approach.

“Snapshot tests are a very useful tool whenever you want to make sure your UI does not change unexpectedly”. This comes from the Jest documentation, and while it is essentially true, its usage easily deteriorates into a chore. Most of your tests will probably look the same:

Snapshot testing

const BananaComponent = (props) => (
  <>
    <span>This is a {props.kind} banana</span>
    <a href={`https://bananas.all/{props.id}`}>Click here for more info</a>
  </>
);

it('renders correctly', () => {
  const tree = renderer
    .create(<BananaComponent banana={ { id: 89, kind: 'Cavendish' } } />)
    .toJSON();
  expect(tree).toMatchSnapshot();
});

Each test case usually has a different combination of props in order to make sure that each set of input produces a consistent output. Here’s the problem, though. The snapshot does not summarize the behaviour of a component: it only contains its actual “implementation”. I am using the term “implementation” in a wider context here, as to indicate the html structure of the rendered output. While some may argue that the visualization is not an implementation, I would counter that in our time and age it actually is; especially if you are using additional libraries and frameworks that inject dynamic styles and elements into it.

What if I want to swap the position of span and button? The behaviour is the same, but the snapshot test will fail because the implementation changed. Sure, the UI changed as well. However, snapshot tests are tricky, meaning that you cannot feasibly alter the test before you alter the code. It’s a one-way feedback: first you update the component, then you watch the test fail and finally you update the test. This process can become so dull that eventually you don’t care about it any more and any time a new change is made, you simply update every snapshot without minding the actual output. Especially in large application, reviewing or checking that snapshots make sense is unthinkable. Reviewers and QA enforces will simply run the application and look at it, making sure it appears correctly and behaves in the right way. Thus the value of snapshots is completely lost.

I think snapshot tests do have value in some cases, but for UI… simply use a style guide like Storybook. That will provide easy access to visual feedback for reviewers without the added ten thousands diffs to each of your pull/merge request. Additionally, you will not need to wait half an hour for your CI to run a pipeline, only to discover you forgot to update your snapshots again.

How fast can you go?

At times, I have seen unit tests that asserted whether a cache was being called. I have mixed feelings about that. First, we can agree that optimizations should not change the actual behaviour of the system, meaning they should not affect the quality and nature of data being produced or ingested, but only the circumstances of its consumption, e.g. speed and memory used to accomplish the task. Thus, adding a caching layer should still cause all your tests to be green; if that doesn’t happen, either you were too zealous in your mocking (as described above) or there was an implementation mishap somewhere, which should be easily fixable.

Speed is a non-functional requirement. It does not dictate what a system is supposed to do, but rather how it should do it. As such, I think its meaning cannot be grasped by a unit test. Even then, it’s challenging to actually to come up with measures that are consistent and objective. Processes run on a CI are always subject to constant changes in load, orchestration, runners being scaled up or down: all factors contribute in making a single measure of speed, e.g. time taken, unsuitable for observation. The same is true for naive assumptions like “all invocations after the first should be faster”. Even then, you need to account for cache invalidation, concurrent threads and other factors that are nearly impossible to predict.

In the end, I am still unsure as to how performance should be accounted for. Sure, auditing and monitoring are powerful tools to continuously assess the performance of a running system. Yet, is there something we can do to ascertain those metrics before actually deploying the application? Some options:

by far the easiest and most consistent is to have a staging environment that is used on a regular basis by QA enforcers, product managers, etc… Monitoring that instance will highlight early symptomps of missing or incorrect optimization. The obvious downside is that it requires human input;
on a side note, if your application is meant to undergo intensive computation, e.g. machine learning processes, I think scientific testing can provide a great deal of insights. By scientific testing I mean collecting performance data over a series of runs, filling in a proper distribution and making some statement about the shape, density, indicators and overall quality of the distribution itself. To put it another way, treat it as a problem of correctness and solve it using the same tools. That of course requires a specific set of skills which may not be available to the average developer.

A final reminder: one test per bug

Through years of experience, I realized one thing always provides value: tests that originate from bugs. I think those are among the most important code you can write when addressing an outstanding problem. When you are working on a bug resolution, write a test for it first. Make it obvious that it fails because the existing code has an incorrect behaviour. It may take time, but I assure you it’s worth it. This way, you will obtain the following:

a reliable way to reproduce the issue while you are working on a fix;
an entry in your version history that testifies where the bug was and why it happened;
the reassurance that when you implement a bugfix, it actually fixes the thing you intended to.

Notice I didn’t explicitly specify what kind of test it may be. Some bugs inherently originate from the interactions between multiple components within the same application, possibly even across different micro-services. You may end up having to write an end-to-end test or an integration test.