DevOps Zone is brought to you in partnership with:

By day I'm a build and release engineer in London, but by night I'm a normal person! If anyone ever asks me what I do, I usually generarlise and say "I'm in I.T." and then check to see if they've already stopped listening. When I'm not working or blogging I can be found playing rugby or cycling around the countryside on my bike, in an attempt to keep fit and fool myself into thinking I'm still young. James is a DZone MVB and is not an employee of DZone and has posted 50 posts at DZone. You can read more from them at their website. View Full User Profile

The Boy Who Cried Wolf – Why Broken Builds Must Not Be Ignored!

07.24.2011
| 8075 views |
  • submit to reddit

We’ve all heard the fable of the boy who cried wolf, an old tale written by a Greek slave who lived a long time ago and liked telling stories. I must stress thought that this was just a story, but like many good stories (Star Wars, Gremlins 2) it’s based on true events. It’s a little known fact (which I have just made up) that the story of the Boy Who Cried Wolf is actually based on an ancient Continuous Integration system (possibly belonging to Spartatech inc, a leading software development company of from around 600BCE).

The are only very sketchy records of what actually happened during this historical event, but here’s what we know for sure*:

  • The Continuous Integration system compiled code and ran unit tests.
  • The unit tests were followed by acceptance tests, which in turn were followed by integration tests.
  • The integration tests were followed by cross-platform tests.

One day, the C.I. system alerted the developers, and anyone else who would listen, that one of the builds was failing!

THE BUILD IS FAILING!!!!!!

THE BUILD IS FAILING!!!!!!

All the workers gathered around the C.I. system to take a look at the error, only to discover that in actual fact, the problem was that the machine on which the tests were running had restarted, thanks to a windows update! The developers went back to work and thought no more of it.

The very next day, the C.I. system went ballistic again, alerting the developers of yet another failure. Once again the team took notice and looked into the error. This time they discovered that the test machine was running slowly and their unit test had timed out. They restarted the machine and it all passed.

By the following week, the C.I. system had alerted the team to 5 other “failed” builds, which were simply a result of the test servers behaving oddly, and no change to the code was required at all. By the end of that week the dev team had stopped paying attention to these alerts, because they all had proper jobs to do, and a lot of work to get through by the end of the sprint (they were agile, even back then – so if you still work according to “Waterfall” THAT’S how far behind the times you are).

Then, one day, a developer checked in a piece of code which failed a unit test. The C.I. system rightly alerted the development team again, but this time nobody came running, in fact, nobody paid the blindest bit of notice. They were all so used to ignoring the C.I. system that when a real problem finally arose, it wasn’t picked up by anyone. Anyway, cutting an already long story a bit shorter, the bug was a biggie, and the next release of their software was so poor that Spartatech inc, the biggest employer in Sparta, went out of business making everyone redundant. This is what basically led to the collapse of Sparta, not sure if you knew that. And here concludes your exceedingly dodgy history lesson.

Back to the present-time: We all know that false-negatives are the enemy of a good Continuous Integration system, they lead us down a path from which it’s increasingly hard to recover. The problem is that it’s just so easy to get into that situation.  Thankfully, there are a few “Continuous Integration best practices” which we can follow, which can help us keep our system in good nick:

  • Make sure the servers on which you run your builds are cleaned regularly, preferably before every build. I would suggest a clean image be deployed each morning. This obviously means that deploying and configuring your system will need to be automatic.
  • Use tools such as VMWare, VirtualBox etc to manage your Virtual Machines.
  • Use tools like Puppet, Chef and Vagrant to automate the deployment and configuration of these VMs.
  • Add the deployment of the VMs to your C.I system!
  • If there are any randomly failing tests, which also randomly fail on your local machine, remove them or rewrite them to make them more reliable.
  • When doing sprint planning, make sure time is dedicated to investigating and fixing broken or flaky builds. It’s very important to ensure the Product and Project Managers are aware of the importance of the C.I. system, and the risk of not maintaining it properly.
  • Treat the rules of Continuous Integration as sacrosanct – if a build fails, fix it as soon as possible.

In an earlier post I wrote about the difference between having a Continuous Integration server and practicing Continuous Integration. If you start to get into the situation where you’re allowing broken builds to go un-fixed, you’re slowly slipping away from actually practicing continuous integration, and soon you’ll just be a development team who has a continuous integration server which is delivering much less value than it ought to.

*I may have got my chronology a little mixed up

References
Published at DZone with permission of James Betteley, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)