You are currently browsing the Test Me posts tagged: Deployment


Goldilocks and the Three Test Sizes

So, she walked into the living room where she saw three chairs. Goldilocks sat in the first chair to rest her feet.

“This chair is too big!” she exclaimed.

So she sat in the second chair.

“This chair is too big, too!” she whined.

So she tried the last and smallest chair.

“Ahhh, this chair is just right,” she sighed.

Assume that everything Papa Bear was a functional test, everything Mama Bear was an integration test, and everything Baby Bear was a unit test, then Goldilocks was a big fan of the unit test.

If you have read through most literature on continuous deployment, you might have noticed that many suggest that they believe having automated functional tests over unit tests are what enable them to deploy quickly and with confidence.

I have taught many about the Testing Pyramid in new hire orientations. The Testing Pyramid both supports and rejects this assumption.

The Testing Pyramid is to be read like the USDA Food Pyramid circa 1990. The larger the area on the pyramid, the more servings you should have. The smaller the area on the pyramid, the more you should restrict consumption. This means that if we took the Testing Pyramid to be interpreted very similarly to the USDA Food Pyramid, we should probably use automated functional tests nearly never, but that is not how the Testing Pyramid is taught. Functional tests are not treated like sweets and fats, they are treated closer to meat or dairy.

When you look at a more annotated version of the Testing Pyramid, this is where the assumption is supported. As you make your test sizes smaller and smaller you support debugging, and as you make your tests bigger and bigger you gain more confidence in the entire system. Scratch that, not the entire system, more like the particular use case that the test covers.

One notion that tends to go hand-in-hand with continuous deployment is pushing small changes frequently. If you are pushing small changes, there is very little code to inspect in the event that the change breaks the site. So why would you need unit tests? The whole point of unit tests was to make it easier to debug my code, wasn’t it?

Let’s step back, in time.

In 2006, I was in my first professional position as a Software Engineer in Test. I was assigned to a project that was in it’s last few months before being launched to the world. The same day I started, there was a change in leadership of the testing organization, and there was a decree

Thou shalt do automated testing from the browser-level down.

…or in other words, functional testing from the browser.

If you remember what tools existed to test from the browser-level down back in 2006, then maybe you will already agree me or you can continue reading and begin to agree with me. If you cannot think of any tools that existed in 2006, then you can just go with that.

The tool that was immediately available to me was an internal tool that did not feel internal. It was developed by a team in an office that people rarely visited and had high employment turn over. Anyways, this tool was written in C++ and it interfaced with COM to drive Windows and IE.

Forget your readable for even non-programmers Gherkin feature stories. Forget your pretty dynamic languages. You had to write your functional tests in C++ and compile your functional tests, or maybe you could have worked on making some clean intermediate layer, but that would be too much development and not enough testing. Sure, Selenium was developed in 2004, but it wasn’t until later in 2006 that the Selenium IDE became a Firefox Extension, of which its existence helped Selenium truly emerge as an option, even though it could be considered undesirable today.

The tools available were not ideal, but there was another knock to these tools. At the time you really needed to use element ids or xpath, you did not really have the option that is popular today, CSS Selectors.

ids seemed to be a pretty good option because either you were testing for an element to be there or not, and it did not matter if its path in the DOM changed or not. Unfortunately, ids were not an option for the project I was working on, and it was unlikely an option for anyone. The incantation of IE in 2006 would render pages significantly slower if even a remotely reasonable number of ids were present, or so I was told. Whatever the slow down factor was, it was greater than the foreseen benefit of having browser-level functional tests.

The other option was xpath. xpath was not ideal because UX changes usually would break the xpath, and you would likely spend more time updating your automated tests, than you would have spent manually testing the scenario.

As it became obvious that the time to write tests would be too long to satisfy the immediate deadline, and the maintenance costs would be too high to even provide proper regression, I sought solutions involving simple cURL and regular expressions to provide a reasonable approximation to browser-level functional testing.

In case you are curious, I taught our manual testers how to work with JMeter (to make HTTP in a less scary fashion) and Paros (for investigating the protocol) so that they could automate some tests, and we found several bugs where we were checking things in the Javascript but not on the server. Over my tenure at that job, they continued to find more bugs of that ilk.

You could be saying to yourself right now, “but the tools are so much better now,” and you would be right. The tools are good enough to actually allow product to be more involved in developer testing practices, which is a topic to be discussed in depth itself.

Even though the creation and maintenance of browser-level functional test suites have decreased with newer versions of Selenium and the advent of Gherkin, there is still an issue that these runners will probably never be able to overcome.

Functional tests have too many points of failure that are unrelated to the correctness of your system under test.

Functional tests test use cases, which involves a larger execution vertical than an integration test. Integration tests, as I have discussed, typically test your code against external dependencies. Many times it is against external dependencies which you cannot control. Yes, you may have installed the database, placed that file in the correct directory, wired your own network, written the other service this code is talking to, etc. But do you really have the power to make sure that the database responds correctly 100% of the time? Do you really have the power to make sure the operating system works with the filesystem correctly all the time? Can you be sure your network never drops a packet? Can you really write a service that is up 100% of the time and responds correctly 100% of the time?

A system can perform no better than the least performant piece.

What I am getting at is that your functional tests will be about as deterministic as the least deterministic, or most flakey, piece of the system under test and the test environment. After all, a functional test not only involves your service that you are testing, but the test script, the test runner, the browser, network, etc.

As an example, think of a system that has a piece that flakes out at a rate of 5%, picking a whole number percent for easy math. If I have one functional test that uses that piece, then approximately 1 out of every 20 executions will fail for reasons unrelated to the code push. This is of course assuming everything else in the system and test environment is completely deterministic. If you execute the test 20 times a day as a beacon to indicate when to deploy and when not too, then you will likely stop progress once a day. How much will once a day cost you?

If the time it takes to do a test run is very short, you will likely just re-run the test, and hope that you are hitting the roughly 95% chance it will pass, rather than the 5% chance it will not. If the time it takes to run the test is very long, you will likely do the same, but start investigating the test failure while you wait for the test to run again. Either way, if you hit the point of investigation, now you not only have to think about the time to re-run the test, but the time it will take to investigate.

If you wrote the test and know whether or not you changed code that the test touches, you will likely have an extremely short investigation. You could re-build, or just roll back while you determine whether or not this was a legitimate failure. If the test execution included changes from others and you have not determined that source of the failure yet, you will investigate the change logs or maybe even have to work with everyone who committed changes to determine if any code changes affected the outcome of the test. All of this isn’t too bad, unless your test suite takes forever and/or flakes a low.

Investigation can be very short when you have knowledge of the test and when the number of people involved is small.

If functional testing is the way to go, every person who contributes will have at least one test of their own. If you have 5 contributors and you all write the same number of tests, there is likely a 1 in 5 chance you will know what the test actually does, but since everyone is your friend, it’s a small shop, you investigations shouldn’t take too long because you can simply turn around and get the answer.

However, if you have a shop with a hundred contributors, you will likely only have a 1% chance of knowing what a given functional test does. You also are likely to not be in close proximity to more than 5 of the other 99 contributors. You are less likely to know them well enough to know when they are around, so it is likely going to take longer to track them down. And because you rarely work with that person you might have the issue of whether or not you can communicate efficiently with that person.

You could see this as an argument for small team sizes, but would you really constrain the size of your team because you couldn’t debug non-deterministic tests efficiently.

You may also be thinking that the worst piece in your system and test environment is more like less than 0.1%, but this example was mostly on a single test with 5% chance of flake. So assume you have 100 tests because you have 100 contributors and each of them only had a single test each, then the chance the entire test suite would flake is 9.52%, assuming the 0.1% flakiness occurs independently per test. Contributors are likely not going to write a single test, let’s assume an average of 5 tests each, which would probably still be too low if functional was the way to go. If you had 500 tests that could flake independently at a rate on 0.1%, the chance your whole test run would flake is more like 39.36%.

Unit tests, that are actually unit tests, tend to be far more deterministic because the sources of flake are far fewer and with lower chance of flaking individually, and so it is easy to have a suite of thousands of tests and use the suite as a beacon of yes or no for deployment.

It was mentioned before that if you are in a continuous deployment environment, your code pushes are likely so small that you should be able to isolate the code you debug to something manageable, and so using unit tests for debugging might be redundant, but automated tests that are executed continuously past the initial deployment turn into regression tests. As your shop gets bigger, people will begin to know a smaller and smaller portion of the entire code base. It will become more difficult for anyone to understand all the nuances of particular functions.

What you call cruft, was a bug fix.

One of the best times to write a test, is when responding to a bug. If you have dealt with a decent size bug database you will know it is not easy to know every bug, so think of having a unit test to help safeguard against regression.

But just as she settled down into the chair to rest, it broke into pieces!

But unit tests still cannot always give you coverage of a whole use case. Unit tests test that your unit meets its expectations assuming that all the units it will collaborate with meet their own expectations. Most of the time your assumptions will be right, but not all the time.

Functional tests are not bad, but automated functional tests just don’t scale. You cannot just keep adding automated functional tests without wasting a lot of time on dealing with the flakiness. Most techniques for maintaining functional test suites actually involve reducing the number of tests in the suite, but that is a whole other article.

Just remember, when you are on your staging environment, assuming you have one, and you are poking around the site, that is functional testing. Most will do it manually, but there is no reason that it couldn’t be automated. While you were developing you could have written a quick automated functional test to save your fingers, but please avoid subjecting everyone else to being at its mercy, unless they ask for it.

PHP UK Conference 2012

Slides for my talk “Scaling Communication via Continuous Integration” are now available on Slide Share.

As presented at PHPUK2012 in London.

PHP Community Conference 2011

Slides for my talk “Is It Handmade Code If You Use Power Tools?” are now available on Slide Share.

As presented at the inaugural PHPComCon in Nashville, TN.


Tags