I am an experienced software development manager, project manager and CTO focused on hard problems in software development and maintenance, software quality and security. For the last 15 years I have been managing teams building electronic trading platforms for stock exchanges and investment banks around the world. My special interest is how small teams can be most effective at building real software: high-quality, secure systems at the extreme limits of reliability, performance, and adaptability. Jim is a DZone MVB and is not an employee of DZone and has posted 99 posts at DZone. You can read more from them at their website. View Full User Profile
What does it take to build reliable and stable enterprise software?
First, stop writing lousy code
It’s unfortunate that few developers are familiar with The MITRE Corporation’s Common Weakness Enumeration
list of common software problems. The CWE is a fascinating and valuable
resource, not just to the software security community, but to the
broader development community. Reading through the CWE, it is
disappointing to see how many common problems in software, problems that
lead to serious security vulnerabilities and other serious problems,
are caused by sloppy coding: not missing the requirements, not getting
the design wrong or messing up an interface, but simple, fundamental,
stupid construction errors. The CWE is full of mistakes like: null
pointers, missing initialization, resource leaks, string handling
mistakes, arithmetic errors, bounds violations, bad error handling,
leaving debugging code enabled, and language-specific and
framework-specific errors and bad practices – not understanding,
improperly using the frameworks and APIs. OK there are some more subtle
problems too, especially concurrency problems, although we should
reasonably expect developers by now to understand and follow the rules
of multi-threading to avoid race conditions and deadlocks.
The solution to this class of problems are simple, although they require discipline:
- Hire good developers and give them enough time to do a good job, including time to review and refactor.
- Make sure the development team has training on the basics, that they understand the language and frameworks.
- Regular code reviews (or pair programming, if you’re into it) for correctness and safety.
will happen: make sure that your design anticipates and handles
failures. Identify failures, contain, retry, recover, restart. Contain
failures, ensure that failures don’t cascade. Fail safe. Look for the
simplest HA design alternative: do you need enterprise-wide clustering
or virtual synchrony-based messaging, or can you rely on simpler
active/standby shadowing with fast failover?
Use design reviews
to hunt down potential failures and look for ways to reduce the risk,
prevent failure, or recover. Microsoft’s The Practical Guide to Defect Prevention, while academic at times, includes a good overview of Failure Modes and Effects Analysis
(FMEA), a structured design review and risk discovery method similar to
security threat modeling, focused on identifying potential causes of
failures, then designing them out of the solution, or reducing their
risk (impact or probability).
Attack complexity: where possible, apply Occam’s Razor,
and choose the simplest path in design or construction or
implementation. Simplify your technology stack, collapse the stack,
minimize the number of layers and servers.
Use static analysis to
measure code complexity (cyclomatic complexity or others) and trending:
is the code getting more or less complex over time. There is a
correlation between complexity and quality (and security) problems.
Identify code that is over-complex, look for ways to simplify it, and in
the short term increase test coverage.
Test… test… test….
for reliability goes beyond unit testing, functional and regression
testing, integration, usability and UAT. You need to test everything you
can every way you can think of or can afford to.
A key idea behind Software Reliability Engineering
(SRE) is to identify the most important and most used scenarios for a
product, and to test the system the way it is going to be used, as close
as possible to real-life conditions: scale, configuration, data,
workload and use patterns. This gives you a better chance of finding and
fixing real problems.
One of the best investments that we made
was building a reference test environment, as big as, and as close to
the production deployment configuration, as we could afford. This
allowed us to do representative system testing with production or
production-like workloads, as well as variable load and stress testing,
operations simulations and trials.
Stress testing is especially
important: identifying the real performance limits of the system,
pushing the system to, and beyond, design limits, looking for
bottlenecks and saturation points, concurrency problems – race
conditions and deadlocks – and observing failure of the system under
load. Watching the system melt down under extreme load can give you
insight into architecture, design and implementation weaknesses.
Other types of testing that are critical in building reliable software:
- Regression testing – relying especially on strong automated testing safety nets to ensure that changes can be made safely.
- Multi-user simulations – unstructured, or loosely structured group exploratory testing sessions.
Failure handing and failover testing – creating controlled failure
conditions and checking that failure detection and failure handling
mechanisms work correctly.
- Soak testing (testing standard
workloads for extended periods of time) and accelerated testing (playing
at x times real-life load conditions) to see what breaks, what changes,
and what leaks.
- Destructive testing – take the attacker’s
perspective, purposefully set out to attack the system and cause
exceptions and failures. Learn How to Break Software.
Fuzz testing: simple, brute force automated attacks on interfaces, a
testing technique that is valuable for reliability and security. Read
Jonathan Kohl’s recent post on fuzz testing.
Get in the trenches with ops
the development team, especially your senior technical leaders, working
closely with operations staff: understanding operations' challenges,
the risks that they face, the steps that they have to go through to get
their jobs done. What information do they need to troubleshoot, to
investigate problems? Are the error messages clear, are you logging
enough useful information? How easy is it to startup, shutdown, recover
and restart – the more steps, the more problems. Make it hard for
operations to make mistakes: add checks and balances. Run through
deployment, configuration and upgrades together: what seems
straightforward in development may have problems in the real world.
in health checks – simple ways to determine that the system is in a
healthy, consistent state, to be used before startup, after recovery /
restart, after an upgrade. Make sure operations has visibility into
system state, instrumentation, logs, alerts – make sure ops know what is
going on and why.
When you encounter a failure in production, work together with the operations team to complete a Root Cause Analysis,
a structured investigation where the team searches for direct and
contributing factors to the failure, defines corrective and preventative
actions. Dig deep, look past immediate causes, keep asking why. Ask:
how did this get past your checks and reviews and testing? What needs to
be changed in the product? In the way that it is developed? In the way
that is implemented? Operated?
And ensure that you followup on
your corrective action plan. A properly managed RCA is a powerful tool
for organizational learning and improvement: it forces you to think, to
work together, creates a sense of accountability and transparency.
Change is bad…. but change is good
You don’t need to become an expert in ITIL, but if you have anything to do with developing or supporting enterprise software, at least spend a day reading Visible Ops.
This brief overview of IT operations management explains how to get
control over your operations environment. The key messages are:
change management is the single leading cause of failures: 80% of IT
system outages are caused by bad changes by operations staff or
developers. 80% of recovery time (MTTR) is spent determining what
The corollary: control over change not only improves reliability, it makes the system cheaper to operate, and more secure.
can be good: as long as changes are incremental, controlled, carefully
managed and supported by good tools and practices. When the scope of
change is contained, it is easier to get your head around, review and
test. And with frequent change, everyone knows the drill – the team
understands the problems and is better prepared if any problems come up.
change control and release management practices. Include backout
planning, rollback planning and testing. Taking compatibility into
account in your design and implementation. Create checklists, reviews.
software, like secure software, doesn’t come for free, especially up
front, when you need to effect changes, put in more controls. You must
have management, and customer, support. You need to change the team’s
way of thinking: to use risk management to drive priorities, shape
design and implementation and planning. Get your best people to
understand and commit: the rest will follow.
Keep in mind of course that there are limits, that tradeoffs need to be made: most of us are not building software for the space shuttle. In Software Quality at Top Speed,
Steve McConnell shows that development teams that build better quality,
more reliable software actually deliver faster, up to a peak efficiency
of 95% defects removed before production release. However, you reach a
point of rapidly diminishing returns as you approach the end of the
curve, attempting to hit 100% defect-free software, where costs and
schedule increase significantly.
Timeboxing is an effective
technique to contain scope and cost: do as much as you can, as good as
you can, within a hard time limit. Following Japanese manufacturing
principles, make sure that anyone on the team can pull the cord and
postpone a release or cancel a feature because it is unstable.
is sobering, almost frightening, how easy it is, how natural it is, for
developers and managers to short-change quality practices, to place
feature delivery ahead of reliability, especially under pressure. Ensure
that you build support across the organization, build a culture that
puts reliability first. Like any change, it will require patience,
commitment, and unrelenting followup.
Published at DZone with permission of Jim Bird, author and DZone MVB. (source)
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)