DevOps Zone is brought to you in partnership with:

Dave Farley has been having fun with computers for nearly 30 years. Over that period he has worked on most types of software, from firmware, through tinkering with operating systems and device drivers, to writing games, and commercial applications of all shapes and sizes. Dave was an early adopter of Agile development techniques, employing iterative development, continuous integration and significant levels of automated testing on commercial projects from the early 1990s. Dave is currently Head of Software Development for LMAX ltd, an organization that is building one of the highest performance financial exchanges in the world. Follow him on twitter: @davefarley77 Book: Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation (Addison-Wesley Signature Series (Fowler)) Dave is a DZone MVB and is not an employee of DZone and has posted 7 posts at DZone. You can read more from them at their website. View Full User Profile

Hypothesis based development

08.31.2011
| 6056 views |
  • submit to reddit

I think that the reason that agile development works is because it is the application of the scientific method to software development.

A fundamental aspect of that is the importance of forming a hypothesis before you start so that you can understand the results that you observe. By this I don’t mean some grand-unified theory kind of a hypothesis at the beginning of a project but the small, every-day, fine grained, use of predictions of outcomes before you commence some software development activity.

I have adopted this approach over the years and I think it has probably become a fairly ingrained part of my approach to things. It is probably most obvious when we are debugging. When working with a pair, or small group of people, discussing a bug I find myself regularly pausing and restating what we know to be facts, and discussing what our current theory is. The outcome of this is invariably that we can more clearly identify the next experiment that will move our understanding on a step or two.

But it is more than that. I almost always, before running a failing test, which I always do before writing the code that will make it pass, state the nature of the failure that I expect to see out-loud. That is I state my hypothesis (the nature of the failure) I then carry out the experiment (run the test). I then observe the results (look at the results of the test) and see if they matched my hypothesis. Often they don’t, it means I have got the test wrong, so I correct it until I see the failures that I expect. This is the main reason that we run the test before we write the code.

Ok, ok, I know that both of these example are a bit simplistic, but this is an enormously powerful approach to problem solving, in fact it is undisputedly the most proven model of problem solving in human history.

My bet is that as I describe it in my simple examples here everyone that reads this is nodding sagely and thinking “of course, that is how we always work.” but is it really? My observation is that this is actually a fairly uncommon approach. Certainly agile development in general and TDD in particular applies a gentle pressure in this direction, but it has been my experience that, all too often, we wander around problems in loose unstructured ways. We often randomly prod at things to see what happens. We frequently jump to conclusions and hare-off implementing solutions that we don’t know that we need.

We had a good example of this at work today. We have recently made some improvements in some of our high-performance messaging code. We put this into the system at the start of the iteration to give us time to see if we had introduced any errors. Our continuous delivery system which includes some sophisticated functional and performance tests as part of our deployment pipeline found no problems for the whole iteration, until today.

Today is our end of iteration, I’ll explain why we finish on Wednesdays another time. So we had taken our release candidate for the iteration and it was undergoing final checks before release. This morning one our our colleagues, Darren, told us at stand-up that he had seen a weird messaging failure on his development workstation when running our suite of API acceptance tests. He had apparently seen a thread that was blocked in our underlying 3rd-party pub-sub messaging code. He tried to reproduce it, and could but only do so on that particular pairing-station. Hmmmm.

Later this afternoon, we had started work on the new iteration. Almost immediately our build grid showed a dramatic failure with lots of acceptance tests failing. We started exploring what was happening and noticed that one of our services was showing a very high CPU load – unusual, our software is generally pretty efficient. On further investigation we noticed that our new messaging code was apparently stuck. Damn! This must be what Darren saw – clearly we have a problem with our new messaging code!

We reacted immediately. We went and told the business that the release candidate that we had taken may not be ready for release. We asked the QA folks who were doing their final sanity checks before the release to wait until we had finished our investigation before making any decisions. We started think that we may have to take a branch, something that we generally try to avoid, and back out our messaging changes.

We did all of this before we stopped and thought about it properly. “Hang on, this doesn’t make any sense, we have been running this code for more than a week and we have now seen this failure three times in a couple of hours.”

So we stopped and talked through what we knew, collected our facts; we had upgraded the messaging at the start of the iteration; we had a thread-dump that showed the messaging stalled; so had Darren, but his dump looked stalled in a different place; we had been running all of these tests in our deployment pipeline repeatedly and successfully for more than a week, with the messaging changes. At this point we were stuck. Our hypothesis, failing messaging, didn’t fit the facts. We needed more facts so that we could build a new hypothesis. We started where we usually start, but had omitted to earlier because the conclusion looked so obvious. We looked at the log files. Of course, you have guessed, we found an exception that clearly pointed the finger at some brand new code.

To cut a long story short the messaging problem was a symptom, not a cause. We were actually looking at a thread dump that was in a waiting state and working as it should. What had really happened was that we had found a threading bug in some new code it was obvious, simple to fix and we would have found it in 5 minutes with no fuss if we hadn’t jumped to the conclusion that it was a messaging problem – in fact we did fix it in 5 minutes once we stopped to think and built our hypothesis based on the facts that we had. It was then that we realized that the conclusions that we had jumped really didn’t fit the facts. It was this and this alone that prompted us to go and gather more facts, enough to solve the problem that we had rather than the problem that we imagined that we had.

We have a sophisticated automated test system and yet we ignored the obvious. It was obvious that we must have committed something that broke the build. Instead we joined together various facts and jumped to the wrong conclusion because there was a sequence of events that led us down the path. We built a theory on sand, not validating as we went, but building new guesses on top of old. It created an initially plausible, seemingly “obvious” cause – except it was completely wrong.

Science works! Make a hypothesis. Figure our how to prove, or disprove it. Carry out the experiment. Observe results and see if it matches you hypothesis. Repeat!

References
Published at DZone with permission of Dave Farley, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Wojciech Kudla replied on Wed, 2011/08/31 - 10:09am

Dave, thanks for this great read. Another painfully true conclusion about how we cope with things we do not entirely understand.
It's interesting one very often falls into this trap of false assumptions even with the right tools in hand. Sometimes we spend a lot of time figuring out the cause of problems during debugging sessions that are supposed to get us closer to the solution, but consequently follow the wrong path. In the end it always turns out it's us and our assumptions that are to blame for spending long hours instead of solving the problem in five minutes.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.