• Advertisement
Sign in to follow this  

GCC Name decoration question

This topic is 2154 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

After posting recently, I was reluctant to adding another post; fortunately this one is less about code and more about just getting information though. Bad part is, it possibly requires someone with intimate knowledge of the internals of GCC.

My question derived from a problem occurring by GCC's name decoration. When GCC compiles a program (Not at the linking stage yet), how does it determine that a function's return type can be coerced to the object that it is being set to?

Ok, that question right away to me, makes me see some ambiguities in what I mean (Some may interpret my question as: "Well a float can be truncated or rounded in value and stored in an int, or an int can be casted to a float...".). That's not what I'm asking smile.png Let me elaborate on my question. (It's going to sound a little "ramblish", but I think it may be necessary to demonstrate where I'm coming from and help anyone reading to understand where my confusion is. I could probably trim a line here or there, but I really think much of it is needed, unfortunately.)

I've put some code up on ideone.com located here: http://ideone.com/pKty2

The problem caused by the name decoration system utilized in GCC (That is somehow works around.):
My confusion stems from the fact that return types are not rolled into decorated names (They are with Microsoft's cpp compiler though; not sure about the .NET compilers...). Anyhow, ok so on line 31, we have an expression, more specifically a line that translates into an assignment operator call. According to the Annotated C++ Reference Manual, it's to catch certain errors. What it does not do is elaborate. What I get from its example is that the design choice was in reaction to a problem stemmed from rolling the return type into the decorated name. More significantly and specifically, because rolling a return type into the decorated name would now make identical function names that have identical function arguments have unique decorated names; since the return types are different and they (The return types) become part of the whole of the decorated name; making the entire decorated name different. (The book doesn't state the last few bits -- a major blunder imho since that's the problem it was trying to avoid; It only stated what was done and gave some code but you're left to logically legitimize the design choice by figuring out what the problem would have been.)

A worded example being played out that progresses in complexity:
Let's start out easy, and look at a normal simple function call with no parameters and just flat out function call (No assignments, etc.). (Rhetorical question) How does the compiler get the correct function? Through the decorating of the scope plus the name plus the arguments, then it goes down the list in the symbol table matching the signature and grabs the address, spins up a function call thunk (Might be stub; not sure on proper terminology here), and emits it as intermediate code representation. No problem, pretty straight forward, pretty simple and easy. Now let's, throw some arguments into the mix. Again, looks up the function matching the signature no problem.

However, in going back to the original problem, let's say there is a single function (In a class or not -- kind've irrelevant for the purposes of demonstrating my question.) just to make it easier. And let's say that function has no arguments, so it's argument list is "v" for void. Now we go back to assignment statement. Somehow the compiler knows whether or not that function (matching the scope, name, and argument list) is of an acceptable return type to the L-Value; despite the return type not being rolled into the decorated name. My question is how?! Please note: this question is irrespective to whether an int can be casted to a float, or a float truncated to an int, etc. My question is the compiler knowing the return is an int or a float -- or the more complex example, a compound object (a struct, class, etc...), which I suppose would be the same implementation in determining if it were a first-class/primitive type (int, char, bool, float, etc...).


My work towards trying to figure this out:
I started thinking about the Cartessian Product Algorithm (CPA), making a super set of smaller sub-sets, etc. But then I realized, wait a min. that doesn't answer the question, because if I'm able to use the CPA then that implicitly assumes I've already discerned the type returning by that function. Then I thought of the typical binary trees. Then I realized: "Well if that's the case, then the return type IS being referred to on some level at some point, somehow -- just not in the decorated name". Then I tried looking at the gcc code, that didn't go so well. heh. I can read code just fine, but the formatting, imho, was just about the most hideous format I had ever seen. I've done a bit of research on it, but I haven't found the answer to how the compiler knows to kick an error out on an assignment statement without knowing it's return type (Or maybe it does?). I know it's obviously not the decorated name, as per the Annotated C++ Reference Manual, and several sites I found online detailing the naming scheme. I've even tried checking the dragon book. This is what I found it says:

p. 391: (Continued from p. 390 on "6.5.3 Overloading of Functions and Operators")
"It is not always possible to resolve overloading by looking only at the arguments of a function. In Ada, instead of a single type, a subexpression standing alone may have a set of possible types for which the content must provide sufficient information to narrow the choice down to a single type"

p. 423:

  • "Function types. The type of a function must encode the return type and the types of the formal parameters"

    Problem 1) Unfortunately, the dragon book is heavily based on designing a java like language, which may have different internal semantics than say GCC.
    Problem 2) The first snippet references Ada. I'm not interested in Ada, I'm interested in C++ tongue.png

    Then I started thinking well maybe it can tell if the types are equivalent enough based on the memory size of the types, but then I realize, again: I'd have to first know the type being returned by the function. There are plenty of solutions I can come up with (Essentially all boiling down to throwing more information at the problem, or rolling the return type into the decorated name -- which is essentially throwing more information that the problem.), but I kinda, personally, consider GCC the defacto standard, not Microsoft, not Borland, etc (Though they are very respectable compilers, and I don't have a single thing against them.). There's just still a piece of the puzzle missing... Any thoughts on this would really be appreciated! I'm kinda at a standstill until I figure this out...

Share this post


Link to post
Share on other sites
Advertisement
I think you've got part of the process reversed. GCC doesn't determine the function type from the name decoration, it determines the name decoration from the function type. In order to call a function you need to provide a function declaration. GCC determines the arguments and return type from that function declaration. When it emits a call it generates a decorated symbol name from the function declaration. GCC won't go looking around in your generated object files for symbol names and extract functions from them, that's what header files are for.

Share this post


Link to post
Share on other sites
Right. I follow that (At least I think I do.). If I understand you correctly, the function declarations in the header files cause a decorated name to be generated. But also, by the sounds of it, you're saying that the symbol tables don't hold the decorated names -- i.e. the decorated names are generated on the fly each time the call is made? Even still, wouldn't return types still have to be part of it?

If I understand the design correctly, GCC should generate the decorated name for each function declaration that the parser comes across. Then upon definition of the function, the code is then inserted into the function. And any call to any function, triggers GCC to generate a decorated name string to be used against the symbol table in the effort to find the appropriate function whose decorated name (That was already entered via a declaration previously.) matches that string. If it does, a call stub is created, if it does not, it errors that an "overloaded function conversion" (If it's an assignment statement.) cannot be found (Or whatever the error is.). So the decorated names are created so they can then act as uniquely named "bookmarks" for any calls to those functions. Is that right? If it is, you mentioned the return type is is part of the generated decorated name; how does that square with the ARM, and various sites that say the return type is not part of the decorated name?

Edit: I think I know why you may be saying I'm getting it reversed.
GCC doesn't determine the function type from the name decoration[/quote]

Well normally yeah, it doesn't. However, (this is what I'm trying to figure out), when you have these decorated names in a symbol table, and a call, as part of the assignment statement, is made to a function, how does the compiler know that the selected function call has an appropriate return type? It must do some kind of checking, otherwise it wouldn't error. In my example, (The ideone.com link I posted in the opening post.) I envision GCC to undergo these steps:
1) See the statement (On line 31 I believe, if memory serves correct.)
2) Generate a decorated name for the function call Calculate (Based on the scope path, the name, and arguments)
3) It then sees a function in its internal symbol table list and selects it
4) It then returns back to where the call initiated (Line 31), and says "Wait. No that's not valid, I'm passing you back a cTest object, and the integer type does not have an assignment operator to handle this (cTest) type so I'm emitting an error instead.".

How does it know what it's returning, so that it can make the determination whether it should error or not? Edited by StakFallT

Share this post


Link to post
Share on other sites
(this is just guessing)

When the compiler finds a function declaration, it adds the unmangled name and it's arguments to a list. When it stumbles upon a call to some_func() it iterates through that list and tries to find a function called some_func() which takes the arguments passed at the call site. If no exact match is found, then it must take implicit conversions etc into consideration. If it finds an overload that can be called with the given arguments, then it can verify that the return type of that function/overload can be used at the call site (for the assigment in your case). The name mangling has nothing to do with this.

[Edit]
The mangled name is just a unique identifier for that function/overload. It's not used for anything except referring to a specific function/overload.

Share this post


Link to post
Share on other sites
hmmm...

[...] it can verify that the return type of that function/overload can be used at the call site [...][/quote] Right, but how does it know the selected function's return type without it being in the decorated name? I can't picture it executing through the code of the function to get the type of object that is going to be returned and then go with that, doing so would have complexities far beyond anything that can be imagined (Things like the order of the functions having their definitions filled out by having been fully parsed -- otherwise you wind up into a situation of one function call, calling another that also hasn't been finished and thus the return type isn't yet known, etc.)

By the sounds of it, I guess the call site is the answer-- upon declaration of a function a call site is created and has the return type stored there (While the return type is readily accessible) -- which means the return type IS being stored... But I thought call sites were only generated when a function call is made not when a function call is declared or defined? If it's being generated when a function call is hit, it's already too late to be storing information about the call it's trying to make since the return type is no longer able to be stored as the parser has moved on since it saw it. In the meantime I'm going to keep researching call sites, I kinda thought I read they were generated on the fly when a function is being called though. Plus, I thought call sites were mostly a construct in Run Time Environments, longer after the compiler stage?


On a side note, doing some quick searching on yahoo for call sites, I stumbled across a news article:
http://www.phoronix....item&px=MTA3NTE

Happy belated 25th birthday GCC! smile.png

Share this post


Link to post
Share on other sites
Forget all about name mangling. That's not relevant at all here.

The parser parses the preprocessed source file from top to bottom. For every declaration it comes across (function, member variable, method, global, local variable, ..), it stores it in some table/list/whatever. When it parses a function/method body and comes across a function call, it looks up the function name in the table. If there are multiple functions in that table with the same name, it'll try to find one that can be called with the given arguments. Once it knows which function/overload is called, it knows the return type of that function/overload since that is stored in the table. Now it sees that the return value is to be stored in a variable. Since the variable and its type is stored in the table, it's easy to verify that the return type can be stored in that variable.

Share this post


Link to post
Share on other sites
Well that was my thought exactly, was that the return type has to be stored -somewhere- if it's not in the decorated name. Because you said exactly what my thoughts were, if it's stored in the table it EASY to determine it's type -- follows my original thought was to just throw more information at the problem. The problem is, I haven't found anything that actually says it is stored in a table, list, call site, etc. Nothing. No mention of the return types ever, only to mention it's not decorated.

All the documentation I ever find on this issue is like it's saying: "I'm thinking of a number. I won't tell you what it is, but I'll give you one hint as to what it's not (In this example three hints), it's not a negative number, it's not a fractionalized number, and it's not an imaginary number." and then be left guessing what the number (In this case, the piece of the puzzle) is. Reminds me of how some of my previous math teachers taught: "I'm going to skip this step here here and here; and while your mind is spinning and is busy spending precious cpu cycles on trying to make sense of things I'll be long past the point I'm at now, which is already past where you are presently.". heh.

What you explained, is this referenced somewhere, or do you know this to be fact or is it a hunch? If it is a hunch, any ideas on how I can go about proving it, besides diving into the GCC code again?

Share this post


Link to post
Share on other sites
I just assume that's how GCC does it.

Exactly how this is done is an implementation detail and might differ from compiler to compiler, versions of a compiler or even depending on compilation options. Some compilers/versions/settings/whatever may use a hash table, others a binary tree, some might even use an SQL database.. What you should be worried about is not exactly how it works but rather how it's expected to behave (documented in the standards) and that behaviour could be achieved by the steps I outlined above.

Share this post


Link to post
Share on other sites
Yeah I know about how it can differ amongst compilers. Was just trying to follow GCC's way of doing it, since it's the defacto standard (To me that is). Then again as a con-gcc side, if GCC's source code is any indicator of how to do something, then it's most certainly in my best interest to not do it that way. Flipping back towards a pro-gcc side, 25 years is a long battle tested run. Though it makes me wonder how much of their code they would love to change and redo, but because it's so enormous it would take -WAY- too long and would be quicker to probably just do a rewrite -- which sides back towards not following their way. Either way, the ways you mentioned about hash tables, binary tree, or the potential of sql databases all implicitly state that the return type is being stored -somewhere- -- it's not just discerned automagically. I guess throwing more information at it is my answer. I just wish there was some more concrete answers as to how GCC does it was out there and then base whether I follow that route or make my own. Like you said, following the expected behavior should all that I need to do. Which is kinda a huge weight off my shoulders knowing I'm free to implement how I want. Thanks to you and SiCrane! :)

Share this post


Link to post
Share on other sites
Actually, it doesn't HAVE to be stored somewhere, in theory the compiled could reparse the entire file to find the declaration it needs whenever it stumbles upon a function call. It would work just as good as storing it in some table would, just be A LOT slower.

Share this post


Link to post
Share on other sites
This leads to interesting situations, because the compiler has thrown away some information.

For instance, we have a header file and two source files. The header file declares two unrelated types:

#ifndef HEADER_H
#define HEADER_H

struct A
{
int x;
int y;
};

struct B
{
int a;
float b;
};

#endif


The first source file contains a simple function implementation:

#include "header.h"

B frobnicate(const B &b)
{
return b;
}


The second source file includes an non-matching function declaration, and attempts to use that function:

#include "header.h"
#include <iostream>

A frobnicate(const B &);

int main() {
B b = { 1, 42.0f };
A a = frobnicate(b);
std::cout << a.x << ',' << a.y << '\n';
}


The following program compiles and links with no error messages.

Using GCC, if the function declaration is in the header file, the following occurs (in this case, two.cpp is the one with the implementation of frobnicate):


user@host:~$ g++ one.cpp two.cpp
two.cpp: In function ‘B frobnicate(const B&)’:
two.cpp:3:24: error: new declaration ‘B frobnicate(const B&)’
header.h:16:3: error: ambiguates old declaration ‘A frobnicate(const B&)’
[/quote]

I can't imagine this happening often, except maybe if there is a function signature collision in unrelated areas of the program and the programmer forgets to implement their new function. The program will link, and I imagine much fun will be had trying to track this down.

Share this post


Link to post
Share on other sites
You can not overload C++ names on return type. This is clearly and explicitly defined in the C++ standard, and relied upon by a number of idiomatic techniques (eg. factory functions aka "virtual constructors").

There is no point in including return type information in the mangled symbol name, since it does not participate in symbol resolution. Mangled symbol names are only of importance at link time. Their sole purpose is to effect C++ type safety in a C-based (type-unsafe) world.

The return type of a function is used in name resoltion at compile time, so the compiler will keep track of it during the first and second phase of name lookup. It is not possible to emit a mangled name until after the name lookup phases are complete, at which point the return type becomes irrelevant.

Share this post


Link to post
Share on other sites

There is no point in including return type information in the mangled symbol name, since it does not participate in symbol resolution.

Yes, there is. It would help catch at link time situations when a function is declared with one return type and defined with another.


The return type of a function is used in name resoltion at compile time, so the compiler will keep track of it during the first and second phase of name lookup.
[/quote]
No, return type is ignored during name resolution. Name resolution for functions includes the types of arguments, but what symbol is looked up is entirely independent of how it is used.

Share this post


Link to post
Share on other sites
Wow. A lot of posts since I've had a chance to reply. I'll quote what I'm replying to so as to avoid any confusion. (I hate quoting though, it always looks like I'm calling someone out, and that's not the case, it's just to make it more organized.)

Posted by DvDmanDT [size=2](Today, 04:45 AM)
Actually, it doesn't HAVE to be stored somewhere, in theory the compiled could reparse the entire file to find the declaration it needs whenever it stumbles upon a function call. It would work just as good as storing it in some table would, just be A LOT slower.[/quote]

That's my guess as well, just adding the darn thing to the table would simplify something that has HUGE potential to be a nightmare of epic proportions in terms of just how complex it can be made into.


Posted by rip-off [size=2]([size=2]Today, 05:32 AM)
This leads to interesting situations, because the compiler has thrown away some information.[/quote]

Exactly my point smile.png Once it's gone, it's gone. I can't see how it gets it back, which is specifically what I'm asking about tongue.png

Posted by rip-off [size=2]([size=2]Today, 05:32 AM)
I can't imagine this happening often, except maybe if there is a function signature collision in unrelated areas of the program and the programmer forgets to implement their new function. The program will link, and I imagine much fun will be had trying to track this down.[/quote]

I think the backend system that deals with clearing up ambiguity engages A LOT more than you might think. Essentially every time an assignment statement occurs, it has to kick in -- because of the return type being thrown away the compile now has to determine the type (BEFORE it can even decide how to coerce if it needs to). Which is a similar operation to dealing with two functions have different return types. It's all looking at the types (somehow). Now in the case of the example you provided, clearly it hit the first frobnicate function and operated under the condition that all other future declarations cannot match that one. As somewhat evidenced by the error. The VERY important problem (Not rolling the return type into the decorated name) that is trying to be avoided is best depicted in what I wrote in my opening post. I'll quote it here for convenience.

Posted by me [size=2](Yesterday[size=2], 06:50 PM)
According to the Annotated C++ Reference Manual, it's to catch certain errors. What it does not do is elaborate. What I get from its example is that the design choice was in reaction to a problem stemmed from rolling the return type into the decorated name. More significantly and specifically, because rolling a return type into the decorated name would now make identical function names that have identical function arguments have unique decorated names; since the return types are different and they (The return types) become part of the whole of the decorated name; making the entire decorated name now different. (The book doesn't state the last few bits -- a major blunder imho since that's the problem it was trying to avoid; It only stated what was done and gave some code but you're left to logically legitimize the design choice by figuring out what the problem would have been.)[/quote]

And yes, the amount of "fun" that could be had trying to disambiguate would be ridiculous. The amount of a mess that I (Or anyone else) could make out of that problem would be insane; a competition of just how nasty someone could make the problem would be easy. Which leads to what DVDmanDT said when he referenced how I mentioned MSVC rolling the return type into the decorated name. I can certainly see why they did. Rolling it into the decorated name, one just has to iterate through the list, stripping off the type. Though two problems with that approach:
1) Every iteration it has to finagle the string and each iteration would make it expensive.
2) You're locked into the name. Whereas, just storing the type in a table somewhere allows for far greater flexibility.
3) Security, the less data someone can figure out about a function, the better -- when it comes to security; for example, an online game and an unscrupulous player determining that a function is returning an int or a char, or some compound object type.

Posted by Bregma [size=2](Today[size=2], 09:35 AM)

There is no point in including return type information in the mangled symbol name, since it does not participate in symbol resolution. Mangled symbol names are only of importance at link time. Their sole purpose is to effect C++ type safety in a C-based (type-unsafe) world.[/quote]

There is if you're dealing with an assignment operator that needs to know it can be set to the return of a function -- i.e. type checking. Also isn't the purpose of a decorated name to locate it in the symbol table (Hence performing symbol resolution). Or by Symbol Resolution you mean removing scope or identical name ambiguity? By the time the linker comes around, it's already established the line number (In the intermediate code representation) to call, so I would say it's the opposite -- that is, at link time it's not important, but at compile time it is. The linker takes all the intermediate code tables and essentially appends (Or "links") them all together, possibly doing backpatching (If needed) due to relative line number destinations != absolute line number destinations. There's probably a sub-phase that occurs on the intermediate code tables that deals with this by backpatching the line numbers being referenced and adjusts them accordingly, but still leaves it in intermediate code form for the linker to work with. The type checking is something done during compile time as evidenced by COMPILE errors if an object is set to a function's return and their exists no assignment operator to handle that function's return type.

I like what SiCrane wrote: "[...] but what symbol is looked up is entirely independent of how it is used." There in lies the problem :/ Locating a symbol based on the name, and arguments is easy, but knowing that the return type of a function symbol selected (So that, if needed, an attempt at type coercion can occur and emit an error if it can't) is the difficulty.

I think I have two choices here, either throw more information at the problem which is the simplest, or check the gnu site for the gcc project page and email one of the maintainers and find out directly from the source (no pun intended.). I might just do both, the latter also because I'm actually interested in finding out how it's done now tongue.png

Share this post


Link to post
Share on other sites

Also isn't the purpose of a decorated name to locate it in the symbol table (Hence performing symbol resolution).

No, it's not. Name look up is performed with the undecorated name. Before name look up is performed you don't have enough type information to generate a decorated name. You get that type information from the name look up. The purpose of a decorated name is to create an identifier that incorporates name, scope and type of a given entity so that linkage can be done with tools that don't explicitly understand scope and type.

When an object file is generated it doesn't have the absolute location of the functions that need to be called. These are the external symbols that are noted as dependencies in your object files. When the linker comes by it looks for the external symbols each object file needs and tells it where those functions can be found.

Share this post


Link to post
Share on other sites
So during a function definition, upon hitting a function call, it starts "scope resolving" to the function call in question? hmm... I think I see the problem, I'm trying to take one system and adapt it to a problem that doesn't solve it that way. So far the closest problem I've encountered that has had a similiar operation has been building a decorated name upon the definition of a function to locate the internal structure (created upon declaration); though I guess technically that would be wrong since I'm using the decorated name not in a way it's meant for. Though, it would seem to me it's faster that way than "scope-diving" (For lack of a better term). But when it comes to function calls and assignments, that's why the decorated name is never mentioned I suppose.

The last part is similiar to what I wrote, which is that the addresses need to be adjusted due to the fact that relative addresses != absolute addresses; the difference being, instead of line numbers being adjusted, it uses the symbol name which then gets looked up and the address gets emitted.

Share this post


Link to post
Share on other sites

The last part is similiar to what I wrote, which is that the addresses need to be adjusted due to the fact that relative addresses != absolute addresses; the difference being, instead of line numbers being adjusted, it uses the symbol name which then gets looked up and the address gets emitted.

Yes, that's the purpose of a linker. Or of the runtime link interpreter for dynamic shared objects (.so files, .DLL files, .dylibs). By the time the linker gets run it's quite possible that information about the original programming language has been lost. As far as the linker is concerned, everything was written in assembly language. Hence the need for name mangling in C++.

Share this post


Link to post
Share on other sites
Well... I guess for me the good news is that my design isn't all that far off. I mean I was still off, but I don't think I was as far off as I initially though after reading that. I just need to go about getting the function inside of my definitions differently, and I do have my answer now.

I think I'm starting to see now why there were pages and pages and pages and pages of tree functions in the GCC code, it's starting to make a little sense now. The trees acted as a way to create the layout of what is under what (I.e. which classes were under namespaces, which functions under classes, etc...), and the functions can cut and paste a tree node (with it's subnodes) and attach them to other nodes -- (Sorta like an ultra-powerful B-Tree library.). And on the linker side the trees are collapsed into a linear single dimension, hence the structure layout is lost. What I'm still a little unclear about though is if the structure isn't needed in the linker phase, why the need to embed the structure? I suppose only because it serves to ensure the names are unique though for that matter sequential numbers could theoretically work to fill that role instead as well? (Not that I would want to, as I think numbers and letters and symbols make for a more flexible system)

Share this post


Link to post
Share on other sites
I'm not sure exactly which structures you're talking about now, but a lot of information generated at compile time is dumped into the end binary either for RTTI or exception handling purposes. Some of those data structures do duty for both.

Share this post


Link to post
Share on other sites
well with code there's a visual and design hierarchy of what is under what, but as far as pure text goes it's just all still text, so the code has to be placed into structures (typically tree nodes) to emulate the design, so that it can be traversed ala parent node, left node, right node, etc. That's all I meant smile.png

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement