Agile Zone is brought to you in partnership with:

Sasha Goldshtein is a Senior Consultant for Sela Group, an Israeli company specializing in training, consulting and outsourcing to local and international customers.Sasha's work is divided across these three primary disciplines. He consults for clients on architecture, development, debugging and performance issues; he actively develops code using the latest bits of technology from Microsoft; and he conducts training classes on a variety of topics, from Windows Internals to .NET Performance. You can read more about Sasha's work and his latest ventures at his blog: http://blogs.microsoft.co.il/blogs/sasha. Sasha writes from Herzliya, Israel. Sasha is a DZone MVB and is not an employee of DZone and has posted 204 posts at DZone. You can read more from them at their website. View Full User Profile

Writing a Compiler in C#: C Code Generation, Part 1

01.03.2011
| 6153 views |
  • submit to reddit

After having discussed in some detail the lexical analysis and parsing phases, it’s time to get our hands dirty with actual code generation. Theoretically speaking, our parser emits an intermediate representation of the parsed program—the code-generator interface, shown below, can be used to construct an actual tree depicting the structure of the program.

For the practical purpose of translating a Jack program to C or assembly language, there’s no need to maintain in memory a real parse tree. By using the symbol state and a small set of auxiliary data structures, we can implement a code generator that emits legal C code. This C code can be compiled by the C compiler to obtain an executable program. (If the idea of compiling Jack to C and then relying on the C compiler seems like cheating to you, consult the history of the C++ programming language. The very first C++ compiler, Cfront by Bjarne Stroustrup, converted C++ to C.)

Let’s take a look at the interface the code generator has to implement. (Obviously, there are some convenience-based choices here, making it easy to develop the code generator specifically for Jack.)


internal interface ICodeGenerator
{

void SetOptions(CodeGeneratorOptions options);
void InitSymbolTables(
SymbolTable classSymTable,
SymbolTable methodSymTable);
void EmitEnvironment();
void EmitBootstrapper();

void StaticDeclaration(Symbol variable);
void FieldDeclaration(Symbol variable);
void ConstructorDeclaration(Subroutine subroutine);
void FunctionDeclaration(Subroutine subroutine);
void MethodDeclaration(Subroutine subroutine);
void EndSubroutine();
void Return();
void BeginWhile();
void WhileCondition();
void EndWhile();
void BeginIf();
void PossibleElse();
void EndIf();
void Assignment(Token varName, bool withArrayIndex);
void Add();
void Sub();
void Mul();
void Div();
void Mod();
void And();
void Or();
void Less();
void Greater();
void Equal();
void LessOrEqual();
void GreaterOrEqual();
void NotEqual();
void IntConst(int value);
void StrConst(string value);
void True();
void False();
void Null();
void This();
void Negate();
void Not();
void VariableRead(Token varName, bool withArrayIndex);
void Call(string className, string subroutineName);
void DiscardReturnValueFromLastCall();
void BeginClass(string className);
void EndClass();
}

If you recall the Jack expression parser we developed a few installments ago, it effectively converts a Jack expression into postfix form by calling the code generator in a post-order traversal of the parse tree. For example, if x and y are terminals (e.g. local variables), then the parser processes the expression x + y by performing the following calls to the code generator:

VariableRead(x, withArrayIndex:false)
VariableRead(y, withArrayIndex:false)
Add()

Another example to drive the point home—the expression true|this==null&(x[3]-y)<13 results in the following calls, indented for easier understanding:

True()
This()
Null()
Equal()
IntConst(3)
VariableRead(x, withArrayIndex:true)
VariableRead(y, withArrayIndex:false)
Sub()
IntConst(13)
Less()
And()
Or()

Evaluating this expression at runtime is best modeled by a stack. Operations like IntConst, VariableRead, This push a value onto the stack; operations like Sub, Less, And pop two operands off the stack and push the result of the operation onto the stack; and so on.

When we’ll develop the x86 assembly language code generator for Jack, we’ll use the assembly PUSH and POP instructions to manipulate the thread’s explicit stack. To emulate a similar process for C code generation, we’ll use a global array of words, called __STACK, and emit a couple of intrinsic macros, __PUSH and __POP, that manipulate the stack.

Assuming that the local variables x and y in the Jack program are represented by the local variables x and y in the resulting C function, compiling x + y to C boils down to:

__PUSH(x);
__PUSH(y);
__SCRATCH1 = __POP();
__PUSH(__POP() + __SCRATCH1);

(Note that I’m using __SCRATCH1 as a scratch register—it’s simply a global variable of type __WORD, the only type we’ll be dealing with unless we want to address type checking.)

This is really wasteful, you say—we could compile the whole thing to __PUSH(x+y) and save a bunch of instructions! That’s true, but we’re going to leave optimization off the table for now, especially considering that we’re using the C compiler in the next phase—and it’s already pretty good at optimizing things. (Again, if the idea of not doing optimization at the intermediate compilation phase strikes you as lazy, consider what the C# compiler does—it emits IL instructions that manipulate the IL evaluation stack, and leaves it up to the JIT to decide whether some stack operations can be optimized or even replaced by register manipulation altogether.)

With that said, we’re ready for the part of the code generator that deals with expressions and the assignment statement—let. (Control constructs, subroutines, and top-level program structure will be dealt with next.)

There are a couple of translation decisions that you need to be aware of prior to reading this code:

  • __WORD is the single type we use for everything. In this version of the compiler, it’s simply int.
  • A Jack class is mapped to a C struct.
  • Jack class fields are mapped to C struct fields.
  • Jack class statics are mapped to C global variables.
  • Jack subroutines are mapped to C functions.
  • Jack subroutines that return void are mapped to C functions that return __WORD. This return value is 0 by convention, and is ignored by the caller.
  • Within a method or constructor, THIS is a local variable that holds the value of this.

public override void Assignment(
Token varName, bool withArrayIndex)
{
if (MethodSymTable.HasSymbol(varName.Value))
{
Symbol symbol =
MethodSymTable.GetSymbol(varName.Value);

Output.WriteLine(
"__PUSH((__WORD)&{0});", symbol.Name);
}
else
{
Symbol symbol =
ClassSymTable.GetSymbol(varName.Value);

if (symbol.Kind == SymbolKind.Static)
{
Output.WriteLine("__PUSH((__WORD)&{0});",
FormatStaticName(symbol.Name));
}
else if (symbol.Kind == SymbolKind.Field)
{
Output.WriteLine(
"__PUSH((__WORD)&((({0}*)THIS)->{1}));",
_currentClassName, symbol.Name);
}
}

//If it's an array, obtain the address of the right
//element by adding the index which is on the stack.
//We need a scratch location because issuing two __POP()
//calls in the same statement does not guarantee
//left-to-right evaluation.
if (withArrayIndex)
{
//The array address is now on the stack, but we
//really need the address of the first element.
//Hence the dereference:
Output.WriteLine("__SCRATCH2 = *(__WORD*)__POP();");
//This is the RHS value that we ought to put in
//the array element:
Output.WriteLine("__SCRATCH1 = __POP();");
//Finally, the top of the stack contains the value
//of the array indexing expression, i.e. the
//element index:
Output.WriteLine(
"*(((__WORD*)__SCRATCH2) + __POP()) = __SCRATCH1;");
}
else
{
Output.WriteLine(
"__SCRATCH1 = __POP();"); //This is the LHS
Output.WriteLine(
"*((__WORD*)__SCRATCH1) = __POP();");
}
}

public override void Add()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__SCRATCH1 + __POP());");
}
public override void Sub()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() - __SCRATCH1);");
}
public override void Mul()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() * __SCRATCH1);");
}
public override void Div()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() / __SCRATCH1);");
}

public override void Mod()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() % __SCRATCH1);");
}
public override void And()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() & __SCRATCH1);");
}

public override void Or()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine("__PUSH(__POP() | __SCRATCH1);");
}
public override void Less()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() < __SCRATCH1 ? -1 : 0);");
}

public override void Greater()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() > __SCRATCH1 ? -1 : 0);");
}

public override void Equal()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() == __SCRATCH1 ? -1 : 0);");
}

public override void LessOrEqual()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() <= __SCRATCH1 ? -1 : 0);");
}

public override void GreaterOrEqual()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() >= __SCRATCH1 ? -1 : 0);");
}

public override void NotEqual()
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH(__POP() != __SCRATCH1 ? -1 : 0);");
}

public override void IntConst(int value)
{
Output.WriteLine("__PUSH({0});", value);
}

public override void StrConst(string value)
{
Output.WriteLine("__PUSH((__WORD)\"{0}\");", value);
}

public override void True()
{
IntConst(-1);
}
public override void False()
{
IntConst(0);
}

public override void Null()
{
IntConst(0);
}
public override void This()
{
Output.WriteLine("__PUSH(THIS);");
}

public override void Negate()
{
Output.WriteLine("__PUSH(-__POP());");
}

public override void Not()
{
Output.WriteLine("__PUSH(~__POP());");
}

public override void VariableRead(
Token varName, bool withArrayIndex)
{
//Put the value of the variable on the top of
//the stack. If it's an array, the value is the
//address of the array's first element.
if (MethodSymTable.HasSymbol(varName.Value))
{
Symbol symbol =
MethodSymTable.GetSymbol(varName.Value);
Output.WriteLine("__PUSH({0});", symbol.Name);
}
else
{
Symbol symbol =
ClassSymTable.GetSymbol(varName.Value);
if (symbol.Kind == SymbolKind.Static)
{
Output.WriteLine("__PUSH({0});",
FormatStaticName(symbol.Name));
}
else if (symbol.Kind == SymbolKind.Field)
{
Output.WriteLine(
"__PUSH((({0}*)THIS)->{1});",
_currentClassName, symbol.Name);
}
}
//If it's an array, dereference it using [].
if (withArrayIndex)
{
Output.WriteLine("__SCRATCH1 = __POP();");
Output.WriteLine(
"__PUSH( ((__WORD*)__SCRATCH1)[ __POP() ] );");
}
}

As always, most of the code (other than the part that deals with variables and assignments) writes itself. This gives us the foundation upon which we can build control statements and the general program structure, up next.

References
Published at DZone with permission of Sasha Goldshtein, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)