Agile Zone is brought to you in partnership with:

I am a programmer and architect (the kind that writes code) with a focus on testing and open source; I maintain the PHPUnit_Selenium project. I believe programming is one of the hardest and most beautiful jobs in the world. Giorgio is a DZone MVB and is not an employee of DZone and has posted 636 posts at DZone. You can read more from them at their website. View Full User Profile

Development of Latex documents

09.24.2012
| 6673 views |
  • submit to reddit

They say learning a new programming language makes you see problems in a new light (if the language are more different than Java and C#). So I take the occasion to write my thesis in Latex, and figure that changing the whole binary category to a document instead of an application would make me learn what's really important in a "creation" process.

I found out that you can apply to authoring Latex documents many concepts and principles (e.g. an automated build) that you know as Agile and Lean practices. Latex itself shares some design principles from programming...

Compilation

Compilation is the process of transforming an abstracted artifact into a more concrete one, which in case of programs can be directly executed. Even in the case of interpreted languages, there is still a build process that instead of compiling produces a new copy of the program, with a clean database and the right file permissions.

Compilation goes from a human-readable representation, optimized for editing and writing, to a machine-readable one, often binary, and optimized for execution (or rendering). I think that both the PDF files produced by Latex and the machine code are pretty unreadable for humans.

The more we abstract and remove duplication, the more we build new representations detached from the physical media: it doesn't matter if a target is an Instruction Set Architecture or a PDF reader.

The power of plain text

The original Pragmatic Programmer book contained a chapter titled The power of plain text, where the superiority of a plain text format over binary documents in the .doc and .xls formats was discussed.

Plain text files are editable with minimal tools like Vim or nano; they can be parsed and filtered with grep and awk; they can be diffed when they change. As a consequence, they can be put under version control with Git or other SCMs efficiently.

Plain text formats also allow a more explicit control over presentation and layout, and on anything that has to be clearly expressed with a directive. You would never think about programming with diagrams, yet deciding the paragraph spacing of a text with an obscure menu instead of specifying it with a directive attracts many people. I guess typesetting doesn't always need such precision at the bit level as programming.

Different models

When I first learned about Latex, I tried to compare it with Cascade Style Sheets in my mental model of how layouts and formatting work. However, the document models of Latex and CSS differ:

  • Latex abstracts away the final file format, retaining presentation details (center, right, large text, put a table on the next page.)
  • HTML and CSS abstracting away all presentation logic (in the CSS) and focus on the logical components of a web page: where is the menu, the navigation bar, which text needs emphasis and where the links to other pages are.

I don't know if Latex style sheets exist, but they would be a welcome addition to continue its quest for duplication removal. Both HTML pages and scientific documents are reworked so many times to benefit from all the synthesis available.

The build process

Programs and document, when written with Latex, often need multiple commands to be chained together to get from the sources to the compiled output. In the case of programs, it's obvious what a build does: compilation, configuration, packaging and even deployment.

But a textual document where all duplication is removed (and every bit of text is named once and only once) needs automation to rebuild a human-readable version. As humans we like to find the same information repeated in different places, both for linking purposes and for convenience.

Consider bibliography, the set of papers and books that a Latex document cites. Bigliographies are organized as sets of .bib files, that may span multiple documents. Each .bib file represents a single citation and contains some metadata on it like the author, title an year.

When building a document, I concatenate all my *.bib files into a single one; pdflatex is then able (after being run multiple times) to incorporate several pages of nicely formatted bibliography into my document. Only the documents I cite with \citet{author_title_code} are included by the build:

#!/bin/bash
rm citations.bib texput.log thesis.aux thesis.bbl thesis.blg thesis.lof thesis.log thesis.lot thesis.out thesis.pdf thesis.toc
bib2bib -ob citations.bib bib/*.bib
pdflatex thesis.tex
bibtex thesis.aux
pdflatex thesis.tex
pdflatex thesis.tex > output.log 2> error.log
egrep -ir 'warning|error' output.log
cp thesis.pdf 2012_10_Sironi.pdf

This build (a single build.sh command) also incorporates some cleaning of logs from previous executions, and a parsing of the output to only show some useful informations like the errors.

So you can easily get from

As \citet{Schleyer_requirementsfor} says...

to

As Schleyer et al. [30] says...

along with getting an automatically regenerated bibliograph at the end of your document. The feeling it's the same as when you performed your first Extract Method refactoring...

Lessons learned

Crafting a Latex document and a program are similar in many respects, from abstract representations, compilations and automated builds. When we abstract away a particular language, operating system, or artifact, we come close to learn some more general principle about programming: the neverending quest for reducing duplication and for defining higher-level abstractions over the details we are burdened with (Laws of Simple Design anyone?)

That isn't to say that we should always strive for general principle, but just that we can ignore the too-specific narrow, hyped practices prescribed as a panacea when we find examples of fields where they don't really matter. For example, I won't think automated tests are the most important thing to write first with respect to duplication removal since we cannot write computer graphics code or documents test-first.

Published at DZone with permission of Giorgio Sironi, author and DZone MVB.

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)