Compilation of a C++ source file produces an object file. This object file contains a number of 'sections.' There's usually one section that contains executable machine code; there's other sections that contain things like hardcoded strings, globals, the names of the functions/objects that this object file contains that can be used by other object files (exported symbols) and the names of the functions/objects that this object file is expecting other object files to define.
Given this definition, it is surely obvious that we can define an operation merge(A, B) on a pair of object files A and B, such that the result of merge(A, B) is an object file the sections of which are made up from the sections of A and B. The strings section of merge(A, B) is simply the strings section of A followed by the strings section of B. The code section of merge(A, B) is simply the code section of A followed by the code section of B.
Most importantly though, the names of objects that merge(A, B) expects other object files to define is not the section from A followed by the section from B - B may contain functions that A requires and A may contain functions that B requires, so the merge process can actually resolve those requirements and remove the object names from the table.
Let's define concatenation on object modules as an alias for merge, such that A.B = merge(A, B) where A and B are object modules; it's very close to regular concatenation, it just includes this symbol resolution step. The full link process is thus a fold of . over the list of object modules A1..AN.
Now, here's an important property of concatenation on object modules - it's associative. (A.B).C is the same as A.(B.C). So, we can say that the fold-left over a set of four modules A,B,C,D, (((A.B).C).D), is the same as ((A.B).(C.D)). Now, pretty obviously, this scales up into a tree structure: the leaves of the tree are the original object modules, and the root is the final, fully linked executable.
It should be fairly clear that this would parallelize extremely well. Given N object modules and N/2 machines, and assuming that the cost of merge() is constant, linking would complete in O(log n) time. That's significantly better than linking everything in a serial manner on a single machine, which is going to take O(n).
OK, time leap from the window of the ivory tower. It's obviously a little more complex than this. What are the complications we have to take into consideration?
- Some modules have sections with special requirements - e.g. requiring that a global variable be at a particular place in memory.
- The cost of merge() is not constant; because of the resolution step that it involves, it's a function of the combined sizes of the unresolved external tables. If we only consider the pathological case, where the imports for the first half of the modules are the exports of the second half and vice versa, then the unresolved external tables for each half will continually build until the final merge in which all resolution must be performed.
- Some objects may be unreferenced in the final product, and for efficiency's sake, should be removed.
- We may want to perform extra optimization steps, such as link-time code generation.
If we resign ourselves to not creating the most efficient end product possible - leave in the unreferenced symbols and forget about LTCG - then the other two may prove to be solvable problems. Is it valid to just ignore the last two issues? I think it is. It means you wouldn't use this stuff for a submission build, for something that you really need the best possible code size and performance, but a large number of the builds I was doing were only debug builds anyway - no optimizations, lots of redundant information, the focus being on testing behaviour rather than performance. For that kind of build, the last two requirements do not apply.
So we're left with two major complications: special positioning requirements for certain sections, and picking modules in such a way that symbol resolution is balanced across all merges to as great an extent as possible.
The first of these is sort of an instance of the scheduling problem, and I reckon it's not much of a problem at all. If we change our merge operation a little to produce lists of sections instead of actual executable images, then it becomes trivial to attach offsets to sections that require them, and to have a tool run after the final merge that simply "pads out" the image such that the sections are in the correct positions.
The second is a lot more interesting. We want to try and pick our modules such that the resolution performed by each merge is at least balanced and at best greater at the leaves of the tree - as that's where we have the greatest number of processors available. Yet, we can't really do this unless we know the cost of a merge, and we don't know the cost of a merge until we do the merge.
The solution lies in the observation that the majority of object files exhibit temporal coherence; if an object file had a particular list of imports/exports last build, then chances are that its imports/exports this build will be fairly similar. So, using the costs of merges from the previous build is likely to give us a reasonable result. What we get is a sort of heap that persists from build to build; each build gives us a new set of costs that we can use to tune the grouping of modules, swapping things here and there to get a better result. The result is that the more you build, the faster it can get.
I've based my knowledge of linking on this article which focuses on the ELF format, and it's possible if not likely that Windows PE is more complex. I've probably also failed to observe some of the complications of the process. What do you think? I'd like to hear some dealbreaking problems [grin]