Speculation on semantically-aware version control
Working on a massive code base comes with some interesting challenges.
Consider this scenario, loosely based on a real-world problem we recently encountered at work:
- Create a multi-million line code base
- Divide this into over two dozen branches for various independent development efforts
- On some given branch, find some popularly-edited code that needs reorganization
- Move some functions around, create a few new files, delete some old crufty files
Either the programmer who did the reorg is burdened with manually integrating all the smaller changes made from other branches into his code prior to shipping his branch upstream, or every programmer across every branch must recreate the organizational changes with their own changes intact.
Obviously part of this is just due to the underlying issue that one file is popular for edits across many branches. But the clear solutions to the problem are all annoying:
- Option One: Force one person to manage many changes. This is gross because it overburdens an individual.
- Option Two: Force all people to manage one change. This is even worse because it requires everyone to fiddle with code they may not even care about.
- Option Three: Never write code which becomes popular for edit. This is obviously kind of dumb.
- Option Four: Never reorganize code. Even more dumb.
What if this were not true?
Suppose we built some language that did not store code in flat text files. Instead, it divides code into a strict hierarchy:
Project -> Module -> Function -> Code
You could insert classes or other units of organization between "modules" and "functions" if the language is so designed, of course.
Now, suppose that instead of having all code in a module go into a folder, and all functions in the module going into text files in that folder, we just treat a module as a data unit on disk.
Within this unit, we have arbitrary freedom to stash code however we want. The IDE/editor/other tools would understand how to open this blob of data and break it up into classes, functions, and maybe even individual statements or expressions.
So here comes the interesting bit. Strap on your helmet, kiddies, we're going to go fast.
- Assign each atomic unit of code (say, a function, or maybe even a statement if you want to get crazy detailed, but that's probably a bad idea) a GUID.
- Store the data as a GUID followed by the textual code associated with that GUID.
- Store alongside the data a presentation metadata model which describes how to show these units of code to the programmer. This should be fully configurable and rearrangeable via the editor UI.
- Each of the smallest level-of-detail objects gets stored in a separate file, identified by its GUID.
This sidesteps our original problem in two interesting ways. First, if we just want to reorganize code in a module without changing its functionality, we can do so by modifying the presentation metadata alone. This allows even an old dumb text-based merge utility to preserve our changes across arbitrary branches.
Second, and more fascinating, what if we want to reorganize code across modules? All we have to do is record that the code from one list of GUIDs moved from one module to another. If we don't store modules as folders, but instead as another layer of metadata, we can make another simple textual change that records the GUIDs belonging to one module GUID in revision A, and another module GUID in revision B.
Why GUIDs? Easy: it allows us to rename any atomic unit of code, or any larger chunk of code units, arbitrarily without breaking any of the system. Delete a function? No problem! Just remove its GUID from version control history like you would have deleted a file in the old approach. Add a function? Also no problem; it just becomes a new file in source control. Move a function around to another file or module or even project? Who cares?! It just changes metadata. Not the code itself.
So let's bring it all together. What if we want to have the exact original scenario? Programmer A makes several changes to organization of some module Foo. Meanwhile, programmers B through Z are making changes to the implementation of Foo, on different code branches.
Merging all of this becomes trivial even under existing integration tool paradigms, because of how we decided to store our code on disk. Even cooler, anyone can engage in reorganizations and/or functionality changes without stomping on anyone else when it comes time to merge a branch upstream.
All this requires is a little lateral thinking and some tool support for the actual code editor/IDE. If you want to support things like browsing the code repo from the web, all you need to do is add a tool that flattens the current project -> module -> function -> code hierarchy into an arbitrary set of text files - again, trivial to do with a little metadata.
The more I think about this approach to version control, the more convinced I am that Epoch is going to try it out. Flat text files are dumb and outmoded; it's time we used computers to do our work for us, the way it was always meant to be.