C++ Compilation Process - C?

Started by
8 comments, last by Zahlman 17 years, 6 months ago
I've been programming for a couple years now, and taken quite an interest in optimizations, leading me to compiler design. Both of them are interesting, at least to me, and something recently caught my interest... In all this complaining that C++ is NOT C with classes, and that they don't work together, I had trouble comprehending this. Because, I could go into my code, look at class, and see that I could implement inheritance, virtual functions, and what not in C. It just takes less time in C. So for a while I thought it was an upset of C, and that's still somewhat in me. But I do see a lot of functionality that's unique to C++... This leads me to my question. While I've been studying compiler design, lexical parsers and all, I was wondering... Does C++ get converted into C, then assembler when compiled? Do some compilers take this approach, and others just go from C++ to assembler? It seems that it would be an easier conversion to go from C++ to C to assmebler, IMO... Anyways, I'm just wondering what ACTUALLY happens in the compiler process. Please, discuss! Thanks in advance!
We should do this the Microsoft way: "WAHOOOO!!! IT COMPILES! SHIP IT!"
Advertisement
Quote:Original post by dbzprogrammer

This leads me to my question. While I've been studying compiler design, lexical parsers and all, I was wondering... Does C++ get converted into C, then assembler when compiled? Do some compilers take this approach, and others just go from C++ to assembler? It seems that it would be an easier conversion to go from C++ to C to assmebler, IMO...


I believe the first C++ compiliers actually compiled to C code, atleast the one Bjarne Stroustrup created did. These days however, all the C++ compiliers I know compile directly to machine code. Also, neither C nor C++ compile(d) to ASM code either.

From http://www.research.att.com/~bs/bs_faq.html#bootstrapping

The first C++ compiler (Cfront) was written in C++. To build that, I first used C to write a "C with Classes"-to-C preprocessor. "C with Classes" was a C dialect that became the immediate ancestor to C++. That preprocessor translated "C with Classes" constructs (such as classes and constructors) into C. It was a traditional preprocessor in that it didn't undestand all of the language, left most of the type checking for the C compiler to do, and translated individual constructs without complete knowledge. I then wrote the first version of Cfront in "C with Classes".

Cfront was a traditional compiler that did complete syntax and semantic checking of the C++ source. For that, it had a complete parser, built symbol tables, and built a complete internal tree representation of each class, function, etc. It also did some source level optimization on its internal tree representation of C++ constructs before outputting C. The version that generated C, did not rely on C for any type checking. It simply used C as an assembler. The resulting code was uncompromisingly fast. For more information, see D&E.
It depends on the compiler. The earliest C++ compilers generated C code (because, after all, C is only a glorified assembler, so it's a good target language). A modern compiler, such as g++, moves through half a dozen internal code representations used for checking, building and optimizing before it finally outputs the binary code. Visual C++ compilers usually generate an ASM listing of their output if asked to. In almost all cases, binary machine code is created directly by the compiler without moving through human-readable text-mode languages like ASM.
Quote:Original post by Serapth

Also, neither C nor C++ compile(d) to ASM code either.


This statement, you mean they took C and directly went to the op codes? I'm a bit confused, because most compilers give you assembler code when you request it...

I mean, would it be faster to go straight to it, or digest it into something easier?

Pray tell.
We should do this the Microsoft way: "WAHOOOO!!! IT COMPILES! SHIP IT!"
Quote:Original post by dbzprogrammer
Quote:Original post by Serapth

Also, neither C nor C++ compile(d) to ASM code either.


This statement, you mean they took C and directly went to the op codes? I'm a bit confused, because most compilers give you assembler code when you request it...

I mean, would it be faster to go straight to it, or digest it into something easier?

Pray tell.


Yes. Most C++ compiliers *can* generate ASM code, but generally they compile direct to object code.
Quote:Original post by dbzprogrammer
This statement, you mean they took C and directly went to the op codes? I'm a bit confused, because most compilers give you assembler code when you request it...

Assembly is just mnemonic opcodes. Writing an assembler is so much easier than writing a compiler.
Inside the compiler, the code first gets parsed into some kind of internal representation in memory. This typically has a tree structure (hence "abstract syntax tree") which reflects the tree structure of a program (think functions containing expressions, which contain factors and terms... now imagine a tree with a function node that has expression-node children, etc.). The compiler verifies the tree (semantic and type checking), then does tree traversals that transform that code into a sequence of opcodes. The opcode is also an abstract concept: each line of an ASM code represents one opcode (more or less) in just the same way that each machine code instructionr does (exactly). So the compiler normally just writes the opcodes out as their machine-code representation, but it is no trouble to get assembly code out instead: it just chooses that representation for writing the opcodes.

If that's difficult for you to digest, then I can offer only three suggestions: 1) take a university course on compiler theory; 2) [google]; 3) learn more about program *design*.
To address some of the original questions: C++ uses an identical compilation model to C, for compatibility. C++ was successful largely because you can blend it with C and it'll work more or less out of the box (at least with much less work than porting between any other two languages).

However, while this is great for teams who need to migrate large source code bases from C to C++, it's a bad engineering practice.


There's a reason why people are against "C with classes" style programming in C++. Most of the good design practices of C are utterly incompatible with good design habits in C++. For instance, a very common C idiom:

struct BaseType{  int typeID;  int SomeVeryCommonField;};struct ComplexType{  int typeID;  int SomeVeryCommonField;  int SomeExtraField;};const int typeID_Base = 0;const int typeID_Complex = 1;void DoFoo(BaseType* p){  if(p->typeID == typeID_Complex)  {    ComplexType* pc = (ComplexType*)(p);    printf("Complex: %d %d\n", pc->SomeVeryCommonField, pc->SomeExtraField);  }  else  {    printf("Generic base: %d\n", p->SomeVeryCommonField);  }}



I've seen (and written) masses of C code using this idiomatic style. In C++, however, this should be reimplemented with inheritance and no explicit "typeID" constants. DoFoo should be replaced, preferably with a custom operator overload that allows you to dump these objects to std::iostreams, which passes to an internal DumpToStream virtual function.


That's just the tip of the iceberg. In C++, the entire IO model is different (stream classes vs. POSIX-style functions). The design principles are different. Techniques like RAII invert and significantly alter the topology of a system's design.

When people make negative remarks about "C with classes", we're not referring to the fact that you can technically implement C++ in C. We're talking about the fact that good C++ code looks vanishingly little like good C code. They are different languages, and should be thought of (and handled) as such.


Keep in mind that due to the Turing-completeness theorems, technically speaking all languages can be interchanged. That is, a program written in language A can be written also in language B (or C [wink]). Although Turing makes no promises that this will be pleasant in any way, it is still technically possible. What's important to remember is that some languages are much better for certain types of problems than others - and while they are all equivalent in a strict mathematical sense, it is never a good idea to come into one language and treat it like it's identical to some other language. Every language has its individual quirks, and "good" and "best" within the context of that language are almost guaranteed to be different from "good" and "best" in other languages.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Quote:Original post by Zahlman
Inside the compiler, the code first gets parsed into some kind of internal representation in memory. This typically has a tree structure (hence "abstract syntax tree") which reflects the tree structure of a program (think functions containing expressions, which contain factors and terms... now imagine a tree with a function node that has expression-node children, etc.). The compiler verifies the tree (semantic and type checking), then does tree traversals that transform that code into a sequence of opcodes. The opcode is also an abstract concept: each line of an ASM code represents one opcode (more or less) in just the same way that each machine code instructionr does (exactly). So the compiler normally just writes the opcodes out as their machine-code representation, but it is no trouble to get assembly code out instead: it just chooses that representation for writing the opcodes.

If that's difficult for you to digest, then I can offer only three suggestions: 1) take a university course on compiler theory; 2) [google]; 3) learn more about program *design*.


Lol =) I like the google option.

I've looked through compiler source before, seen how it compiles for a certain language, I was just wondering if there was a SPECIFIC way that most compilers compiled. It seems to vary a bit...

But, alright, that makes a lot of sense, thanks!

We should do this the Microsoft way: "WAHOOOO!!! IT COMPILES! SHIP IT!"
Quote:Original post by dbzprogrammer
I was just wondering if there was a SPECIFIC way that most compilers compiled.


There's a pretty standard pipeline, and a lot of standard theory for each step (especially for lexing and parsing), but the devil is in the details. For the generalities, you are referred back to options 1 and 2. For the specifics, 2 and 3. [smile]

This topic is closed to new replies.

Advertisement