DevOps Zone is brought to you in partnership with:

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at He tweets at @markhneedham. Mark is a DZone MVB and is not an employee of DZone and has posted 553 posts at DZone. You can read more from them at their website. View Full User Profile

The 5 whys: Another attempt

  • submit to reddit
Towards the end of the week before last and the beginning of last week we’d been having quite a few problems with our QA environment to the point where we were unable to deploy anything to it for 3 days.

A few weeks ago I wrote about a 5 whys exercise that we did in a retrospective and in our weekly code review we decided to give it a go and see what we could learn.

We started with the question ‘Why was there a mess?‘ and then branched out the first level whys since it was fairly clear that there wasn’t only one thing which had contributed to our problems.

Mess lil

We ended up with 4 answers to the first why:

  • There was a DNS change
  • Volume was deleted from our QA server
  • System tests failing
  • Change in one project hanging QA deployment
  • Main build broken for a while

We then worked across the whiteboard taking each of these in turn.

I think our approach allowed us to avoid part of ‘the cult of the root cause‘ which Don Reinertsen wrote about.

It still wasn’t quite spot on due to some mistakes I made while facilitating but these were my observations:

  • Once we got to answering the whys for the 4th and 5th first level whys the whiteboard was way too cluttered and it had become quite difficult to see exactly where we’d got up to.

    As a result we lost the discipline around answering the question why and drifted off into general discussion around the original question but stopped drilling down further looking for a potential root cause.

    The next time I think it would probably work better to look for the first why and collect any potential other whys on the same level in a ‘parking lot’ type area which we could then go to later on.

  • Having said that, a neat thing about having the whys alongside each other was that we were able to see that the first two whys were linked to each other.

    Both changes had been done by someone in the operations team based on conversations they had with people on our team.

    We realised that our communication with the operations team hadn’t been entirely clear and had left room for doubt which had led to unexpected changes being made to the servers.

    This was an example of us stopping before we’d drilled down to 5 levels having realised that we could influence the situation positively even if we hadn’t found the root cause of the problem.

  • Drilling down into the ‘System tests failing’ led to the most interesting insights:
    • System tests failing
      • Noone cares about them
        • We can push to QA even if they’re broken
        • Used to them failing
          • Perception amongst devs that they’re flaky
            • There had previously been a time when data changed frequently and broke them.
        • Seen as being owned by the QAs
          • The tests were defined by QAs
        • The time from checkin to system tests failing is quite long

    Looking back at this now we probably should have drilled a bit further down on some of the whys.

    We actually ended up discussing the perception amongst the developers that the tests were flaky and it was pointed out that most of the failures were actually real.

    We don’t currently have a ‘stop the line’ mentality if the systems tests fail but have agreed to adopt that approach for the next iteration and check at the end of this week to see if we’ve improved.

  • Even though I didn’t facilitate the exercise perfectly I think there was still a far greater level of analysis done by the team in this exercise than in others that I’ve seen.

    I’ve noticed that a lot of retrospective type exercises tend to only encourage surface level analysis so we never really go deeper into a subject and see if we can actually make some useful changes to the way that we work.

Published at DZone with permission of Mark Needham, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)



Phil Kirkham replied on Mon, 2011/11/14 - 8:29am

Refer back to a blog on this actual site for how Douglas Squirrel does root cause ( I went to this presentation, it was good ), goes along with a lot of what you're saying here “You know you are at the fifth why because it hurts, and there is usually a pause ”

Paul Russel replied on Sun, 2012/06/10 - 8:37am

I'd suggest trying with Post-Its or cards rather than whiteboard.  Or using Flying Logic Pro with a projector

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.