So I was asked to impart some wisdom the other day as to what name mangling is and why it's needed, how it happens, etc. So I'm going to do my best to explain what happens when the compiler tries to grok a piece of code. Well, there's really two questions to answer -- one is why name mangling happens historically, and the other is why does it still happen in modern compilers? But first, let's talk about what it actually means.
Let's say you have a class named Foo and looks something like this (in our imaginary language):
Class Foo Function Bar( Integer baz, String whirlygig ) Returns Decimal Sub Wahoo( Reference Integer goober ) Sub LaLaLa() End Class
When the compiler comes across a declaration like this, it needs to convert it into something a little more compiler-friendly. So it turns the entire class into a data structure filled with all sorts of information about everything having to do with the class. One of the things it does is mangle the methods into a unique string. The reason it does this is historical. With traditional C, every method must have a unique signature to it. Basically, no two methods can have the same name. So when the compiler would pass the method names off to the linker, it just used the method names themselves since they would always be unique. With the advent of C++ (which at first was really just a C pre-processor instead of a true language), came method overloading. Once this feature came about, suddenly method names were no longer going to be unique! So instead of rewriting every C linker out there, they just decided to make every method name unique in-place, and hence name mangling came about.
So what does it mean to mangle? Well, it takes the human-readable form of the method declaration, and it compiles it down into a compiler(and linker)-readable, unique string that so the linker can decide what do with the method. In modern compilers, it takes things like the class name, the method name, the parameter list and return type and shoves it all together into a single string. But it doesn't have to -- many older C++ compilers just add a random string of data to the end of the method name to ensure that it's a unique name. But let's assume we're a modern compiler. So for our above example, the compiler would have a single entry for the function Bar and it may look something like this:
$d%Foo%Bar%$i$s
You may be confused at this point as to what that gibberish means. Well, the first part ($d) is the return value of type Decimal, the % means "new field", Foo is the name of the class, followed by another % for the next field. Then Bar is the name of the method itself, followed by another %. Then comes the parameter list: $i means an Integer and $s means a String. So, the compiler can take a look at this string and infer a ton of information from it. It know that it returns a value, what class it comes from, what method, what parameters is takes, number of parameters, etc. Let's mangle out the other example methods to get a better picture of why this is useful.
Foo%Wahoo%#r$i
Foo%LaLa
Here's where things get interesting. You'll notice that if there's no return type, the first character of the mangled string is anything but $. That makes it very easy for the compiler to tell that this function cannot be on the right-hand side of an = statement (ie, the following is illegal: bar = Foo.LaLa()). Also, if there's no % after the method name, it takes no parameters (so it's easy to tell that this is illegal: Foo.LaLa( 12 )). Also, you'll notice with the Wahoo method call, you can encode other pieces of information very easily. The #r shows that there's a Reference modifier for the integer ($i) parameter. This makes it very easy to extend the language since you don't have to parse massive amounts of information and change the compiler all over the place -- just mangle things.
So why is name mangling needed? Well, technically, it's not if your language will always have unique names to every method. Also, you could store everything in human-readable form and just parse the declarations manually in the linker, but that is a very expensive operation, and when you have large projects with millions of lines of code, you definately do not want to deal with that sort of overhead. It also keeps the memory footprint low since you only end up storing a dozen or so characters instead of a very verbose declaration string in your compiler structures.
How does name mangling happen in a modern compiler? Well, it's basically just a mini-compiling operation. You're given the declaration as a set of tokens, and you just "compile" the token into an identifier for the mangled name. Let's take the method Bar as an example.
Assume that you know you're in Class Foo since you're already starting to compile the methods for Foo. You're given a string of tokens (basically, every time you see a space in the declaration, you're separating tokens. So each word in the declaration comes to you one at a time. Also, '(', ',' and ')' are tokens). Here's the declaration again: Function Bar( Integer baz, String whirlygig ) Returns Decimal
Assume that you have a string variable to hold the mangled name which is called "theMangledName".
theMangledName = Foo%
Token: Function
Seeing this token means you know to expect a Return after the parameter declaration. This helps you weed out errors in the declaration. Continue on.
Token: Bar
Must be the function name since it came after a Function or Sub token. the MangledName = theMangledName + Bar
Token: (
Eat, make sure there's a ) somewhere else.
Token: Integer
A parameter type, so add on the proper $ specifier. Also, note that this is the first parameter in our list, so we need the field seperator as well. theMangledName = theMangledName + %$i
Token: baz
Eat in terms of mangled name. But really, this will generate a temporary variable with method-level scope using the proper name, so it's not really eaten.
Token: ,
Eat.
Token: String
A parameter type, so add on the proper $ specifier. theMangledName = theMangledName + $s
Token: whirlygig
Eat in the same way that we ate baz.
Token: )
Eat, we matched the parenthesis properly, so there's no error.
At this point, because we saw Function and noted it, we expect to see a Returns token. If we saw a Sub instead, then we would throw an error if we saw the Returns token.
Token: Returns
Eat, just language verbage.
Token: Decimal
Now we know what to prepend to the mangled name for a return type. So put the proper type specifier and field separator in. theMangledName = $d% + theMangledName
Now we're all done getting tokens for this line, and so we finally end up with the proper mangling: $d%Foo%Bar%$i$s The compiler can then add this single, easy to parse string to the symbol table so that it knows all the information it cares about when dealing with the method call. It can then pass the information off to the linker so that the linker can decide what method calls to dead-strip out of the finished executable, etc.
So, as you can see, name mangling really isn't anything mystical in terms of how you do it. As for the why -- it simply boils down to being the most practical way to store information. So why do modern compilers decide to store information in a mangled name in the first place? Why don't they just continue to work the same way that many C++ pre-processors worked originally? Simple: people wanted the ability to know information about the method at runtime itself. This feature is called introspection -- and it's basically a way for you to gain extra information about classes, methods, parameters, etc at runtime without a huge amount of overhead.
On a side note, if you ever wonder why you cannot dynamically load a C++ class from a library, you should now have a better understanding of why that is impossible. The names are mangled, and because it's C++, it may not store any useful information about the method (such as what class it belongs to, or what parameters it takes). You can get away with dynamically loading C functions from a library simply because C does not mangle the method names, so you have a unique string that you can look up in the library and load.
So there! A very long, and very geeky post that I bet no one will bother to read. But if you happen to still be reading this, feel free to ask questions if you have them! Also, if there's a geek topic you'd like to see posted, let me know.
Leave a comment