DevOps Zone is brought to you in partnership with:

Larry has posted 12 posts at DZone. You can read more from them at their website. View Full User Profile

Save the baby code

05.24.2011
| 3275 views |
  • submit to reddit

Code reviews are both a standard part of the development process and the biggest wasted resource in software engineering.

Approaches vary from face-to-face discussions to online systems like Review Board.  They share two things: They’re arguably the most effective way to assess code quality, and they’re expensive. 

Yet even as we pay experts to evaluate the actual code, we manage with metrics like code-coverage and defect counts that provide indirect (and possibly delayed) signals about its health. If we could somehow quantify those reviews, the insights could lead to improvement. 

Faced with a similar problem, Virginia Apgar published in 1953 a paper titled: A Proposal for a New Method of Evaluation of the Newborn Infant that changed obstetrics and neo-natal practice around the world.  She did it by devising a simple, 10 point scale that rated newborns on five categories like muscle tone, color etc, awarding 0 to 2 points for each.

In the words of Atul Gawande, Apgar’s score ”turned an intangible and impressionistic clinical concept- the health of newborn babies- into numbers that people could collect and compare.” This led to two kinds of innovations: One produced new techniques to save babies with low scores; the second brought advances that led to increased average scores. The result was a 16X improvement in infant mortality and 140,000 lives saved each year in the US alone. 

To do this, Apgar first demonstrated that her score was a true measure of newborn health. She divided 2,096 newborns into three groups according to their scores. Mortality for the middle group was an order of magnitude worse than the best group, while the lowest group’s mortality was an order of magnitude worse still:

  • Infants receiving 0, 1 or 2 scores: 14%
  • Infants receiving 3, 4, 5, 6, 7  scores: 1.1% 
  • Infants receiving 8, 9, 10 scores: 0.13%

Having established the score’s effectiveness, she went on to demonstrate the advantages of one technique over another by comparing the scores they produced. The results for ways to deliver anesthesia, for example

  • Spinal anesthesia: 8.0 
  • General anesthesia: 5.0  
  • Epidural or caudal: 6.3

showed clear differences between the techniques. The result was was the widespread adoption of the rating system and ongoing competition among doctors, hospitals and researchers for improved scores.

What does this have to do with code reviews?  The health of newborn code is also an “intangible and impressionistic” concept. It needs an Apgar score so that teams can learn and improve.  

Virginia Apgar and a newborn

There are complications: First, a baby is a baby, but checkins vary from a one-line bug fix to a huge body of code. This can potentially be addressed by normalizing scores with respect to the amount of code reviewed.  Second, no single attribute of code health is as unambiguous as death. This is more troubling, but it can be approached the way Apgar approached infant health: devising a score and comparing it to actual results. In this case, the results might be defect counts and other measures of quality. 

Here is my first pass based on conversations with a few hackers: First, I would measure correctness as a raw count of identified defects.  For the remaining criteria, I would assign a rating of 0, 1 or 2 points.  The categories are:

  • Readability: (Inadequately documented or poor naming, Acceptable or NA, Clearly documented, well-chosen names)
  • Test coverage: (Inadequate, NA or marginal, Fully covered)
  • Simplicity: (More complex than necessary, Acceptable or NA, Complexity appropriate to requirements)
  • Performance: (Inadequate and material, NA or Immaterial, Appropriate to requirements)
  • Reuse (Inadequate or inappropriate use of existing code, NA, Appropriate use of existing code.

Like babies and their Apgar scores, the code would be rated twice: Once on first submission and once with approval (unless, of course, it was approved on first review).

Other, better approaches are possible.  What would you do? 

By themselves the scores do nothing to improve your process, just as Apgar scores alone don’t improve an infant’s health.  The important step, one that will challenge your knowledge and creativity, is to relate them to your other data, understand what this tells you about your process and invent ways to improve things.

References
Published at DZone with permission of its author, Larry White. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Philopator Ptolemy replied on Tue, 2011/05/24 - 8:19am

This is a very interesting idea. I agree that there is a lot of subjectivism and uncertainty in code reviews.

What is needed for a more quantative approach to reviews is empirical data that would prove the model. "To do this, Apgar first demonstrated that her score was a true measure of newborn health."

There are tools like Sonar that try quantifying whatever can be deduced by machine analysis. It would be interesting whether Sonar's stats truly correlate with "newborn health".

Larry White replied on Wed, 2011/05/25 - 8:37pm in response to: Philopator Ptolemy

Thanks. That's exactly what I'm hoping to do at Google - when I'm not too busy with my day job - trying to ship code ;)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.