How stuff works: the RB compiler

| | Comments (6)

It continually suprises me how frequently I get into a conversation about the RB compiler, and I realize that the person I'm speaking to doesn't really know all the pieces and parts. I'm so used to thinking about it that it's easy for me to forget that it's not second nature to most people. So I want to spend a little while explaining the high-level architecture of the RB compiler. This should help you to understand how your REALbasic projects turn into executable files.

The first thing to understand is that the compiler deals only with plain old source code -- so everything about your project eventually needs to turn into plain text. All of the windows, pictures, classes, etc all get converted into REALbasic source code before it is sent to the compiler. This process is called "rendering", and is performed by the IDE. When a compilation pass starts, the IDE takes all of the project items and ask them to render themselves to source code, and the end result is a big, flat-file chunk of text. During this rendering process, project items can also tell the compiler about various pieces of resource data such as files on disk, icons, etc. Those are "rendering attachments" which aren't really compiled, so much as copied to where they need to be. "Rendering errors" are errors that occur during the rendering phase, and they halt compilation before the compiler is even involved. You get a rendering error when something is so horribly wrong that we know the compiler couldn't possibly cope with it. For instance, you use an external code item, but the data doesn't exist on disk -- the project item cannot render itself to source code in that situation, and so there's no point to even trying to compile the project.

I've spoken before about the fact that windows are really converted into namespaces with all sorts of magic behind them to make them work. However, windows aren't the only code item that have special rendering work associated with them. Any sort of resource project item ends up being rendered out as a module with special methods. For instance, let's say you drag a picture into your project. At render time, a module is created, and that module has a global method with the project item's name which returns a Picture object. This way, you can use the project item name in your code, and it behaves just like a Picture object.

After the data has been rendered into plain source code, it is passed to the compiler along with a set of build options and rendering attachments. The build options tell the compiler things like "please compile for this platform" and "please include function names", etc. Basically, all of the special properties on the blessed App class properties list are passed along to the compiler as build options. Since the build options and rendering attachments are pretty obvious, I won't say much more about them. The source code is the interesting part, right!

The REALbasic compiler is a multi-pass compiler, which gives it more power than if it were just a single-pass compiler (which the old REALbasic compiler, pre-5.0 essentially was). The first pass parses the source code looking for declarations, and the second pass actually compiles the source code. This allows your source code to refer to declarations without worrying about the order of things -- it allows for what's called "forward referencing." You can test this out yourself in RBScript with something like the following:


Class Foo
Inherits Bar
End Class

Class Bar
End Class


Without first collecting all of the declarations, this code couldn't possibly compile. Oh, and I should note that by declarations, I mean just code item declarations. Classes, modules, methods, constants, etc. I don't mean method-level declarations (like local variables).

As I mentioned, the first pass parses all of the declarations and is called the "declaration parser." While making this pass, the compiler parsers through the source code line by line. As it discovers a declaration, it creates an appropriate symbol and adds it to the global symbol table. This pass skips over non-declaration source code entirely, and it doesn't report any errors except for the most severe. So, for instance, if the parser sees a Sub declaration, it skips over lines of source code until it finds and End Sub line. The types of errors that the declaration parser will report are only the ones which we absolutely cannot recover from (like unmatched #if/#endif lines). If the declaration is malformed, then the parser tries to recover from it as best as possible. The vast majority of error reporting happens during the second pass.

One the declaration parsing pass completes, a little bit of bookkeeping is performed before starting the second pass. The compiler does things like computing inheritance data so that the next pass can perform quicker. After this is done, the second pass is started. This work done by this pass is known as the "semantic engine" as it's responsible for the semantics of proper source code. The code is parsed a second time in its entirety. As declaration lines are parsed, we do some sanity checking to make sure the symbols exist and are correct. As non-declaration lines are encountered, they are "compiled." This is where the interesting work happens that most people associate as being the compiler's job.

Let's say that the parser notices a "for" token, that's the start of a for loop construct. The compiler ensures that all of the pieces of the for loop are correct -- for instance, the loop variable is a number. Then the compiler continues to parse and compile each of the lines in the body of the for loop. Finally, when the "next" token is discovered, the semantics engine can actually compile the loop itself. It does so by emitting an intermediary code called "TAC" code. TAC stands for "three address code", and is like a high-level assembly code. The entire job of the semantics engine is to translate RB source code into TAC code. So let's say that you have the code:


foo = new SomeClass

The compiler will create some TAC code that allocates enough space to hold a SomeClass object. Then it will generate some TAC code to call SomeClass.Constructor if that exists. Finally, it will generate some TAC code to perform the assignment to the foo symbol.

The entire process from declaration parsing until now is collectively known as the "frontend" of the compiler. From here on out, we're going to be talking about the "backend" of the compiler.

Once the semantics pass is complete, the compiler has transformed all of the lines of REALbasic source code into TAC code. Either that, or it has reported an error to the IDE if the source code isn't correct in some fashion. Now the compiler can generate some metadata TAC instructions for doing things like registering a class' description with the runtime so that calls to "new" function properly. This metadata stage is really just more bookkeeping, kind of like what happens after the declaration parsing stage.

The next job is for the "code generator" to take the TAC code and convert it into machine code based on the appropriate CPU architecture specified by the build options. Currently, that's either PPC or IA-32 machine code. Note that we don't change the TAC code into assembly code and then run it through an assembler -- there's no need to do that. All an assembler does is take the assembly and turn it into machine code, so we simply skip that step and go straight to machine code.

Since TAC code doesn't directly correspond to assembly code, the code generators have to do a bit of housekeeping work to ensure that it writes out correct code. For instance, one architecture may pass parameters to a method by storing them into registers, while another might push the parameters onto the stack. The TAC code generated by the semantics engine doesn't care about that stuff -- it simply generates a TAC code that says "here's another parameter" and the code generator can deal with the details.

So the code generator ends up translating the TAC code into machine code, and now you've got an executable chunk of data. But that's not the end of the story! The next step in the backend process is the linker. It's the linker's job to ultimately spit out a file (or series of files) that can be loaded up by a particular operating system. Up until this point, we haven't given a single hoot about the OS -- but now it's time to take the executable data and create a PE32 file, or a Mach-O file, etc. The linker takes a bunch of different pieces of data, such as the executable data, the resources, etc and it arranges the data in such a fashion that the end result is a valid executable file for the target OS.

The first thing the linker does is take the resource data (in the form of rendering attachments) and pushes it into the proper format for writing out to disk. Then it takes the executable data, and it performs a bunch of "fix ups" on it. These fix ups are things like creating a more concrete address for methods and data. It also takes a list of "imports" that were generated by the code generator for things like declares. After collecting all of the various pieces of data and massaging them around, the linker passes the data off to the "application builder" to actually write the data out to disk. The application builder is really part of the linker, but we've abstracted it slightly to account for multiple different ways to write the same application out to disk. For instance, on OS X, we could be writing out a bundle, or we could be writing out a console application.

Once the linker has completed its duties, you now have our desired end result: the application.

Hopefully this information will help you to better understand some of the terminology I may throw around at times. It should also help you to understand the reason behind some behaviors a bit better. For instance, you now understand why a method cannot declare a structure -- it's because the declaration parser skips over method bodies, and in order to change that behavior, we'd have to restructure a fair amount of the compiler to accomodate. As usual, if you have any questions about how things work, just ask!

6 Comments

Fascinating! :D

Thank you for that, Aaron.

Good stuff.

I always thought incantations and cauldrons were involved somehow.

You know the old saying about technology appearing as magic to the primitives? The is like the magician explaining how the trick is performed. : - )

Russ

@Russ
It was Arthur C Clarke who said
Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke, "Profiles of The Future", 1961 (Clarke's third law)

I'm glad you guys found this interesting -- it's one of those topics that I wasn't quite sure would have any interest.

Bah it's all magic and mumbo jumbo anyways :P

This is really cool stuff. I've read about some of this stuff before, like symbol tables and semantic parsers and TAC. But it did confirm some of what I had read, as well and corrected some of my misconceptions, like thinking that TAC was one step above assembly that could be translated directly. Also, I didn't know were the linker fit in. I still don't know how the linker fixes the addresses and knows where to put things, though. But great article! Thanks, Aaron.

Leave a comment