OddThinking

A blog for odd things and odd thoughts.

There’s only one thing worse than having unit tests…

As my parts of my current project mature, I realise that there are exactly two types of unit tests in my suite, to correspond to the two common types of change that I am making.

The first is the fragile test. A typical example would have 60 carefully hand-calculated result values. When I tweak an unimportant threshold value that couldn’t possibly trigger any real bugs, it just tweaks performance (e.g. now run 15 threads in the thread pool rather than 10), each of those results must change and require recalculating by hand.

The second is the robust test case. This is the type that passes despite the fact I have just made major improvements to the code without touching the corresponding test-case. If the old code and the new code both make the test case pass, what is the test case actually testing?

Most commonly, it is because I have encountered an unexpected exception in later tests, or in the field, due to corrupt data being passed on by another component, and that provokes me to (a) improve the handling of the corrupt data nearer the source and (b) improve the handling of unexpected exceptions. Neither of these paths happen to be tested by the old test code. Until 100% branch coverage is suddenly discovered to be a sensible investment of time, they probably never will be.

There is actually a third set of test cases – the ones that correspond to the times when I don’t make any changes. These are the ones that test some non-deterministic behaviour, and just fail randomly now and again (on perfectly good output) because the scheduler happened to take a different path to the one it did when the predicted output was determined. It might be possible to make these tests more robust, but the effort isn’t justified – especially if the component under test is stable and believed to be be bug-free (by manual inspection of the differing outputs).


Of course, this is a very cynical view; it ignores the benefits I get from my tests. I should be happy with the revealed opportunities to improve the tests to make them both more robust and to have better coverage.

But mainly, it helps get me annoyed at the people who evangelise automated unit-tests as though they are a magic panacea.

I got very frustrated with the first book on Extreme Programming that I read. The author continually harped on about how exciting it was to have unit-tests, and that he would often press the button to run the whole suite when he sat down at his computer.

I couldn’t understand what he was talking about; it made no sense to my understanding of development.

It wasn’t until very late in the book he admitted it was really only practical if your unit tests ran in under 20 seconds. Twenty seconds? What project has unit tests running that quickly?

I am on a small project, and my unit-tests for the main program alone take over five minutes. Several of the individual modules take far more than 20 seconds to tests. Running the unit-tests is a productivity killer, because it forces me to get distracted from my main development focus while I wait.

At the time I read the book, I worked on a large, old project which, in my opinion, had woefully low unit-test coverage. Running all the automated unit-tests it did have took hours.

It left me shaking my head at the real-world experience this author must have had.


To summarise in the vernacular, unit-tests suck – I wouldn’t bother with them if it wasn’t that the fraction of bugs they do manage to catch, suck even more.


Comments

  1. NOTE: There are tools that analyse your code and rerun only the tests for the things impacted by the things you’ve changed – unfortunately I don’t think I’ve seen it for Python.

    Why do you have to wait for the tests – why can’t you run them in the background and move onto something else, and come back when/if the tests fail?

  2. When I tweak an unimportant threshold value that couldn’t possibly trigger any real bugs, it just tweaks performance (e.g. now run 15 threads in the thread pool rather than 10), each of those results must change and require recalculating by hand.

    Isn’t that, well, a bug?

    I don’t know your exact situation but my preference is to not using threads in a unit test environment. Threading adds a level of non-determinism which is exactly the sort of thing that a unit test should be preventing. In fact, by mocking out the inter-thread communication you can test all sorts of race conditions which are ordinarily very rare.

    If your test cases are fragile in this way, it suggests to me that the tests aren’t granular enough. In other words you are testing more than just individual “units” of software.

    If the old code and the new code both make the test case pass, what is the test case actually testing?

    It’s testing that you haven’t introduced any regressions with the new code … (Sorry this is such an obvious point that I think I must be missing something here?)

    But mainly, it helps get me annoyed at the people who evangelise automated unit-tests as though they are a magic panacea.

    Evangelists are often annoying, but hopefully my own position on this topic isn’t too extreme.

    Twenty seconds? What project has unit tests running that quickly?

    I know of a project with about 1M lines of code, roughly 40% of which is unit tests. The unit tests run in under 1 minute. So yes, these projects exist.

  3. Rohan, moving on to something else is exactly what I do, with the terrible productivity impact; the something else tends to be Facebook, Google Reader or getting some snack food I don’t really need! Or in this particular case, writing an ill-considered rant on OddThinking at 2 am.

    I need a bell to play when the unit tests finish to bring me back to work.

    You got me thinking about the source of context-switches, and there’s another post coming.

  4. The first question you have to ask yourself is, “what are the unit tests testing?” “The code” is the wrong answer. In reality, the unit tests check that you have an understanding of what your code should do. They ask you to describe your code, from the other side.

    Realise that, in theory, you could write the code very thoroughly and have (fragile) unit tests automatically generated for you. Similarly, you could write unit tests very thoroughly and use some genetic algorithm (for example) to generate the code! A unit test is the same thing as the code, but backwards. You’re doing it so you can prove to yourself that you know what you’re doing.

    Maybe I’m not that smart, but I’ll often find that in the course of writing my test, I’ll realise that I’ve done something wrong — either the code is incorrect, or the interface doesn’t make any sense, or the problem is in fact impossible to solve. Whatever it is, it’s thinking of a problem from both ends that adds rigour.

    If your unit tests fail occasionally, then do you really understand what your code is meant to do? I mean, you wrote the unit test as a declaration of “my code will definitely do this”, but it doesn’t do it!

    Some white box testing is what you seem to describe as fragile. There are reasons for and against writing these sorts of tests at all. Fragile tests generally point to crystalline code. Inflexible things which are going to cause you problems later. It means you should re-think your design. OTOH maybe you’re just really thorough, and want to make sure that a particular (necessarily complex) object undergoes state changes in the right ways.

    Unit tests change the way you code, to make the code better. They also give you confidence that the code works, so you can be bolder in your refactoring. Regression testing aside, these are by far their biggest advantage.

  5. Alastair, I am forced to agree with everything you say, and where I don’t, it was because I was unclear. I hope I did make it clear that, on the whole, I am pro-unit tests. I just want someone else to do them for me, especially as they are generally on the critical path, because they aren’t much fun. Is that too much to ask? 🙂

    Isn’t that, well, a bug?

    I was unclear, because I was trying to abstract away from the real-life actual example I had. My software will accept some odds if they are, say, within a 10% range of a given number. After the tweak, it will accept the odds if they are within an 15% range of a prediction. The code displays the range of odds on dozens of bets – each one now needs recalculating – a laborious effort when I am very confident that the code change was successful – all I changed was a constant from 10 to 15.

    The unit test could calculate this value rather than me doing it by hand, but that risks the old problem of the code and the unit-test having the exact same bug, so I have stuck to hand-calculated values.

    Sticking to the original example I gave in the post, if the mythical code used to divide some work into work packages to be shared between each of the tasks, and printed the progress was output as each task completed:
    Package processed. 10.0% done.
    Package processed. 20.0% done.
    ...

    Then the new code would output
    Package processed. 6.6% done.
    Package processed. 13.3% done.
    ...

    A trivial change – one that I am confident wouldn’t be affected by a tweak of a constant, yet with dozens of instances, it is a pain to update.

    I don’t know your exact situation but my preference is to not using threads in a unit test environment. Threading adds a level of non-determinism which is exactly the sort of thing that a unit test should be preventing. In fact, by mocking out the inter-thread communication you can test all sorts of race conditions which are ordinarily very rare.

    Agree in principle. Difficult to do in practice.

    I try to make the thready parts do the least they can. I do have mock-ups for quite a bit of the inter-thread communication. I am aware I could improve this. Only yesterday, I refactored some code in one of the threads, so a substantial part of the complexity was in a separate module where it could be unit-tested deterministically.

    I also try to stick to well-known idioms of threading (e.g. producer/consumer) to reduce the risk of race-conditions, deadlocks etc.

    My top-level thready component is probably still too complex, making unit-testing difficult. However, I am reaching a point of diminishing returns.

    If your test cases are fragile in this way, it suggests to me that the tests aren’t granular enough. In other words you are testing more than just individual “units” of software.

    This is a tricky one to answer. My testing approach follows my coding approach. Try to push things into loosely-coupled modules. Test the modules individually. If one module sits on another, test the submodule, then test the other module – it is bigger than the smallest possible unit, but still a unit test. If the submodule is (or is talking to) a substantially different component (particularly other servers), mock it out. Mocking is relatively expensive and is only done for high-risk areas, but a very valuable strategy.

    I inspect code coverage every couple of months or so. I don’t have a particular target percentage in mind, I’m just check that my strategy isn’t surprisingly off-base.

    I don’t think any of this is particularly unusual.

    It’s testing that you haven’t introduced any regressions with the new code … (Sorry this is such an obvious point that I think I must be missing something here?)

    Yeah, again I wasn’t terribly clear.

    If it was a black-box test, yes, it shows no regressions. That is valuable.

    However, the test didn’t find the original bug, so it shows no progress either. Best practice would be to first write a test that fails, then make the change, re-run the test and show it works. I plead guilty to not always doing this.

    Why? In some cases, because the mock object doesn’t happen to have the ability to generate the exact corruption/unexpected exception that was seen in the field, and the time taken to add that would be measured in hours or days, and I can’t justify it for a once in a while intermittent bug whose expected impact costs “the customer” (me!) less than a dollar per month.

    Another aspect I am struggling with at the moment, as I move away from the inappropriate use of doctest is that Python’s unittest framework, by default, pays no attention to all the little log messages my units produce (from DEBUG through to CRITICAL messages) while doctest, by default, gets caught up on every single one.

    I am having trouble determining which is better:

    I used to create fragile tests with doctest that broke every time I made a trivial change that didn’t affect the unit’s real output.

    Now, I am writing robust black-box tests that don’t notice that I have just improved the logging of certain events to make the operational side easier. The ability for me to use the product is affected by the quality of the logging on the production system, and to have my unit-tests miss this key requirement leaves me, again, wondering whether to test is actually testing what it should.

    (Having the tests ignore CRITICAL error messages is bad, no matter which way you slice it, but not yet a solved problem for me with the unittest framework. I guess I need to mock the logging framework, to send (only) some types of log to a data structure where they can be inspected by the unit-test. On my To Do Someday list.)

    Evangelists are often annoying, but hopefully my own position on this topic isn’t too extreme.

    Your stated position is that unit tests are a clear win for the customer. I agree.

    I think that, on balance, unit tests are a win for the programmer too, but not a magical cost-free win. The length of this response demonstrates, I hope, that I am neither ignoring unit tests, nor permitting myself to get too sidetracked gold-plating them, but struggling to find the right balance.

    I know of a project with about 1M lines of code, roughly 40% of which is unit tests. The unit tests run in under 1 minute. So yes, these projects exist.

    Wow. I’m trying to imagine half-a-million lines of code that could run in under a minute. Can’t have significant thread usage. Can’t have significant I/O (network or disk), can’t be compute intensive (e.g. graphics processing), presumably no little or no UI. Am I missing something? Doesn’t sound like a very common pattern – but probably a great project for meeting quality requirements!

    Compare that to my code.

    There are some severely compute intensive parts – but I don’t normally touch those parts, and their unit tests aren’t run very often.

    There are some non-deterministic/thready code – the tests often runs slowly because the test thread needs to be confident that the other thread has had an opportunity to finished before inspecting the results.

    There is plenty of code that tests networked interactions to machines scattered around the globe. Ping times are a bastard, and just waiting for the account to log in is slow.

    There is also a lot of time-based code, even if it is deterministic. For example, the scheduling/throttling code that ensures that a particular call to the server occurs at most x times per y seconds. Another component sends keep-alive messages to a server. It takes time to run such tests… unless I go and mock Python’s time module – let me add that to my To Do Someday list.

  6. Sunny,

    In reality, the unit tests check that you have an understanding of what your code should do.

    Yes, this is a very TDD approach to unit testing. I am at a point beyond the initial implementation of the code, so my understanding of what the code should do is pretty good. I am no longer discovering what the code should do. Now I am at the stage of tweaking the code, and making sure I didn’t make a silly mistake.

    Similarly, you could write unit tests very thoroughly and use some genetic algorithm (for example) to generate the code!

    I love this idea. Very provocative.

    Maybe I’m not that smart, but I’ll often find that in the course of writing my test, I’ll realise that I’ve done something wrong — either the code is incorrect, or the interface doesn’t make any sense, or the problem is in fact impossible to solve. Whatever it is, it’s thinking of a problem from both ends that adds rigour.

    Agreed; It is a very powerful technique to force yourself to think through your design and implementation.

    If your unit tests fail occasionally, then do you really understand what your code is meant to do? I mean, you wrote the unit test as a declaration of “my code will definitely do this”, but it doesn’t do it!

    With non-deterministic code, that becomes trickier. Sometimes my test case is merely saying “My code should do this, most of the time.”

    An example: This code should send a keep-alive message every 5 seconds. Run it for a minutes, and the keep alive count should be 12. Actually, 11-13, due to fence post possible errors, and inaccurate clocks. But most of the time 12. Or it could be 10, if the processor load is high, and the thread under test was starved of CPU, or 14 in the unusual case that the test thread, alone, was starved, and didn’t get to stop the test in time. Or zero if there is a network outage (which isn’t the units’ fault).

    In practice, I write the test to check for 11 or 12, with a note on failure that this is non-deterministic, and to only get worried if it repeatedly fails.

    OTOH maybe you’re just really thorough, and want to make sure that a particular (necessarily complex) object undergoes state changes in the right ways.

    This is a particular concern. In one place in my code, an important GO/NO GO decision is made. The code merely returns a Boolean, but logs its reasons for doing so. I test lots of different cases, but I really want to forgo black-box testing and check that the reasoning is good. It would be a shame if the test case for condition B is accidentally passing the test because it gets the same return value for reason A – leaving the B condition untested. I am making to test fragile to convince myself I have covered all the branches – if the branches change, the test should fail to ensure I add new code for the new branches.

    Unit tests change the way you code, to make the code better. They also give you confidence that the code works, so you can be bolder in your refactoring. Regression testing aside, these are by far their biggest advantage.

    It is a good point. I have a similar view about Code Review. Knowing someone else is going to be looking at your code makes you lift your game to avoid embarrassment.

  7. With non-deterministic code, that becomes trickier. Sometimes my test case is merely saying “My code should do this, most of the time.”

    With non-deterministic code the way I solve the problem is to use a factory (or dependency injection) to pull out the “random” part. This will usually either be a fake timer or a fake call to random(). Your unit test can then mock that out, and you can test the code properly.

  8. Knowing someone else is going to be looking at your code makes you lift your game to avoid embarrassment.

    I can’t help but be reminded of your earlier idea, and wondering whether the same wouldn’t work with code reviews…

    More than a million dollar idea, I think. Now all we need to do is turn it into a franchise and we’ll NEVER WORK AGAIN. (Do you like how it’s now all about “we”?)

  9. I think that, on balance, unit tests are a win for the programmer too, but not a magical cost-free win.

    Only fools believe in magical cost-free wins.

    This code should send a keep-alive message every 5 seconds.

    Exactly. You don’t want to test whether it ran 12 times, you want to test whether it sends a keep-alive at a rate of every 5 seconds, within a reasonable ε. (That’s assuming you can’t factor out the non-deterministic part and mock it, as Sunny said. In this case you should be able to.)

    The code merely returns a Boolean, but logs its reasons for doing so.

    Sounds like you should factor out and mock, once again. Instead of returning a boolean, return a more structured value, and have a wrapper turn that into a boolean for use by regular code, which is bypassed by the tests. Or factor out the individual branches, test them individually, and mock them out when testing the method that combines them. (It’s hard to say what the best course would be, given your problem statement.)

    Code has to be explicitly written to be testable. (However, testing evangelists point out, by and large, easily testable code is also easily usable and adaptable code.)

Leave a comment

You must be logged in to post a comment.

Web Mentions

  1. OddThinking » Is the Compiler a Distraction?