DevOps Zone is brought to you in partnership with:

I am an experienced software development manager, project manager and CTO focused on hard problems in software development and maintenance, software quality and security. For the last 15 years I have been managing teams building electronic trading platforms for stock exchanges and investment banks around the world. My special interest is how small teams can be most effective at building real software: high-quality, secure systems at the extreme limits of reliability, performance, and adaptability. Jim is a DZone MVB and is not an employee of DZone and has posted 95 posts at DZone. You can read more from them at their website. View Full User Profile

Lessons in Software Reliability

08.15.2011
| 10734 views |
  • submit to reddit
What does it take to build reliable and stable enterprise software?

First, stop writing lousy code

It’s unfortunate that few developers are familiar with The MITRE Corporation’s Common Weakness Enumeration list of common software problems. The CWE is a fascinating and valuable resource, not just to the software security community, but to the broader development community. Reading through the CWE, it is disappointing to see how many common problems in software, problems that lead to serious security vulnerabilities and other serious problems, are caused by sloppy coding: not missing the requirements, not getting the design wrong or messing up an interface, but simple, fundamental, stupid construction errors. The CWE is full of mistakes like: null pointers, missing initialization, resource leaks, string handling mistakes, arithmetic errors, bounds violations, bad error handling, leaving debugging code enabled, and language-specific and framework-specific errors and bad practices – not understanding, improperly using the frameworks and APIs. OK there are some more subtle problems too, especially concurrency problems, although we should reasonably expect developers by now to understand and follow the rules of multi-threading to avoid race conditions and deadlocks.

The solution to this class of problems are simple, although they require discipline:

- Hire good developers and give them enough time to do a good job, including time to review and refactor.

- Make sure the development team has training on the basics, that they understand the language and frameworks.

- Regular code reviews (or pair programming, if you’re into it) for correctness and safety.

- Use static analysis tools to find common coding mistakes and bug patterns.

Design for failure

Failures will happen: make sure that your design anticipates and handles failures. Identify failures, contain, retry, recover, restart. Contain failures, ensure that failures don’t cascade. Fail safe. Look for the simplest HA design alternative: do you need enterprise-wide clustering or virtual synchrony-based messaging, or can you rely on simpler active/standby shadowing with fast failover?

Use design reviews to hunt down potential failures and look for ways to reduce the risk, prevent failure, or recover. Microsoft’s The Practical Guide to Defect Prevention, while academic at times, includes a good overview of Failure Modes and Effects Analysis (FMEA), a structured design review and risk discovery method similar to security threat modeling, focused on identifying potential causes of failures, then designing them out of the solution, or reducing their risk (impact or probability).

Cornell University’s College of Engineering also includes a course on risk management and failure modes analysis in its new online education program on Systems Approach to Product and Service Design.

Keep it Simple

Attack complexity: where possible, apply Occam’s Razor, and choose the simplest path in design or construction or implementation. Simplify your technology stack, collapse the stack, minimize the number of layers and servers.

Use static analysis to measure code complexity (cyclomatic complexity or others) and trending: is the code getting more or less complex over time. There is a correlation between complexity and quality (and security) problems. Identify code that is over-complex, look for ways to simplify it, and in the short term increase test coverage.

Test… test… test….

Testing for reliability goes beyond unit testing, functional and regression testing, integration, usability and UAT. You need to test everything you can every way you can think of or can afford to.

A key idea behind Software Reliability Engineering (SRE) is to identify the most important and most used scenarios for a product, and to test the system the way it is going to be used, as close as possible to real-life conditions: scale, configuration, data, workload and use patterns. This gives you a better chance of finding and fixing real problems.

One of the best investments that we made was building a reference test environment, as big as, and as close to the production deployment configuration, as we could afford. This allowed us to do representative system testing with production or production-like workloads, as well as variable load and stress testing, operations simulations and trials.

Stress testing is especially important: identifying the real performance limits of the system, pushing the system to, and beyond, design limits, looking for bottlenecks and saturation points, concurrency problems – race conditions and deadlocks – and observing failure of the system under load. Watching the system melt down under extreme load can give you insight into architecture, design and implementation weaknesses.

Other types of testing that are critical in building reliable software:

- Regression testing – relying especially on strong automated testing safety nets to ensure that changes can be made safely.

- Multi-user simulations – unstructured, or loosely structured group exploratory testing sessions.

- Failure handing and failover testing – creating controlled failure conditions and checking that failure detection and failure handling mechanisms work correctly.

- Soak testing (testing standard workloads for extended periods of time) and accelerated testing (playing at x times real-life load conditions) to see what breaks, what changes, and what leaks.

- Destructive testing – take the attacker’s perspective, purposefully set out to attack the system and cause exceptions and failures. Learn How to Break Software.

- Fuzz testing: simple, brute force automated attacks on interfaces, a testing technique that is valuable for reliability and security. Read Jonathan Kohl’s recent post on fuzz testing.

Get in the trenches with ops

Get the development team, especially your senior technical leaders, working closely with operations staff: understanding operations' challenges, the risks that they face, the steps that they have to go through to get their jobs done. What information do they need to troubleshoot, to investigate problems? Are the error messages clear, are you logging enough useful information? How easy is it to startup, shutdown, recover and restart – the more steps, the more problems. Make it hard for operations to make mistakes: add checks and balances. Run through deployment, configuration and upgrades together: what seems straightforward in development may have problems in the real world.

Build in health checks – simple ways to determine that the system is in a healthy, consistent state, to be used before startup, after recovery / restart, after an upgrade. Make sure operations has visibility into system state, instrumentation, logs, alerts – make sure ops know what is going on and why.

When you encounter a failure in production, work together with the operations team to complete a Root Cause Analysis, a structured investigation where the team searches for direct and contributing factors to the failure, defines corrective and preventative actions. Dig deep, look past immediate causes, keep asking why. Ask: how did this get past your checks and reviews and testing? What needs to be changed in the product? In the way that it is developed? In the way that is implemented? Operated?

And ensure that you followup on your corrective action plan. A properly managed RCA is a powerful tool for organizational learning and improvement: it forces you to think, to work together, creates a sense of accountability and transparency.

Change is bad…. but change is good

You don’t need to become an expert in ITIL, but if you have anything to do with developing or supporting enterprise software, at least spend a day reading Visible Ops. This brief overview of IT operations management explains how to get control over your operations environment. The key messages are:

Poor change management is the single leading cause of failures: 80% of IT system outages are caused by bad changes by operations staff or developers. 80% of recovery time (MTTR) is spent determining what changed.

The corollary: control over change not only improves reliability, it makes the system cheaper to operate, and more secure.

Change can be good: as long as changes are incremental, controlled, carefully managed and supported by good tools and practices. When the scope of change is contained, it is easier to get your head around, review and test. And with frequent change, everyone knows the drill – the team understands the problems and is better prepared if any problems come up.

Implement change control and release management practices. Include backout planning, rollback planning and testing. Taking compatibility into account in your design and implementation. Create checklists, reviews.

Safety First

Reliable software, like secure software, doesn’t come for free, especially up front, when you need to effect changes, put in more controls. You must have management, and customer, support. You need to change the team’s way of thinking: to use risk management to drive priorities, shape design and implementation and planning. Get your best people to understand and commit: the rest will follow.

Keep in mind of course that there are limits, that tradeoffs need to be made: most of us are not building software for the space shuttle. In Software Quality at Top Speed, Steve McConnell shows that development teams that build better quality, more reliable software actually deliver faster, up to a peak efficiency of 95% defects removed before production release. However, you reach a point of rapidly diminishing returns as you approach the end of the curve, attempting to hit 100% defect-free software, where costs and schedule increase significantly.

Timeboxing is an effective technique to contain scope and cost: do as much as you can, as good as you can, within a hard time limit. Following Japanese manufacturing principles, make sure that anyone on the team can pull the cord and postpone a release or cancel a feature because it is unstable.

It is sobering, almost frightening, how easy it is, how natural it is, for developers and managers to short-change quality practices, to place feature delivery ahead of reliability, especially under pressure. Ensure that you build support across the organization, build a culture that puts reliability first. Like any change, it will require patience, commitment, and unrelenting followup.
References
Published at DZone with permission of Jim Bird, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Mike Dzone replied on Mon, 2011/08/15 - 8:10am

This article really points out the fragile state Java Enterprise Software development is in these days.  Your coments about "First, stop writing lousy code", "Keep it Simple", and "Test… test… test…" are the complete opposite of what the standard is for enterprise software now.  The standard now is to write lousy code, make it as complex as possible, and completely ignore the real testing which is needed.  Let's take a look a the first two, "First, stop writing lousy code" and "Keep it Simple".  I blame these two on ever increasing number and complexity of frameworks - Spring and Hibernate taking most of the blame. I've never written more lousy code or complex code then when I have to deal with these frameworks.  Althought they are suppose to help, what they really end up doing is cause developers to desperately Google for examples on how to use them them and blindly cut & paste the examples and hack them up to get them to do what what's needed.  This makes for some really lousy code.  It also makes for very overly complex code since frameworks are really the golden hammer anti-pattern gone awry. Then there is "Test...test...test...".  The test-driven and unit-testing crazy has had an extremely detrimental effect on software testing.  Why? Because it has created a culture where testing is pushed onto developers and organizations no longer believe staffing a good testing team/environment is needed.  After all, if all the unit tests pass then the software must work right?  

Jonathan Fisher replied on Mon, 2011/08/15 - 10:08am

I think Java has little to do with it. If you look around the internet, articles showing "How fast it is to develop with X language and NoSQL" where X is one of: Ruby, Python, Scala, Haskell, Erlang, Lua, Clojure etc.

A passion to always use the bleeding edge contributes a large share of software reliability problems. Even if Scala is "mature" after a few years of active use, the community and knowledge base are vetted out. Java, being a generation behind, has all of things things conquered.

NoSQL contributes another large problem. Relational will "work" for 85% of projects, but a large majority of projects these days are throwing the baby out with the bathwater.

But all things in context; if you're trying to be first to market and don't care about the longevity of your software, write it in Scala and use MongoDB. Chances are your business model will change before you need to do maintenance on your software, and your code will be come obsolete within a few months.

Mark Unknown replied on Mon, 2011/08/15 - 8:01pm in response to: Mike Dzone

Quite the opposite of my experience. Hibernate and Spring (and the like) have helped me write better and less code. They have allowed me to do more complex things and keep MY code simpler. The problem is that we don't have good mentoring and training. Everyone thinks because they can do Access or Excel that they can program. If they are not googling how to use Hibernate/Spring, they are hacking out tons of code or reinventing the wheel. Of course you are right in that the frameworks are seen as golden hammers. Nothing can substitute having talent and knowing what you are doing. I have seen the same thing happen with patterns. As for a testing team ... They really have to be good and know what they are doing. If they are just following a script and not thinking.. it just adds to your work load. So, if they know what they are doing, why are they not developers? And honestly, if testing can be automated, why not? Human testing should be out of the box testing. The reality is that most companies don't have the budget or are big enough to support all that is really needed in software development. If i have to choose, I'll take TDD, BDD, Hibernate and Spring and other tooling and developers who know how to use them over a BA, DBA and Tester (etc).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.