Jump to content

  • Log In with Google      Sign In   
  • Create Account






Speculation on semantically-aware version control

Posted by ApochPiQ, 17 October 2013 · 521 views

Working on a massive code base comes with some interesting challenges.

Consider this scenario, loosely based on a real-world problem we recently encountered at work:
  • Create a multi-million line code base
  • Divide this into over two dozen branches for various independent development efforts
  • On some given branch, find some popularly-edited code that needs reorganization
  • Move some functions around, create a few new files, delete some old crufty files
Given these changes, how do you cleanly merge the reorganization into the rest of the code tree? Between the time of the reorg and the merge, other people have made edits to the very functions that are being moved around. This means that you can't just do a branch integration and call it good; everyone has to manually recreate the changes in one way or another.

Either the programmer who did the reorg is burdened with manually integrating all the smaller changes made from other branches into his code prior to shipping his branch upstream, or every programmer across every branch must recreate the organizational changes with their own changes intact.


Obviously part of this is just due to the underlying issue that one file is popular for edits across many branches. But the clear solutions to the problem are all annoying:
  • Option One: Force one person to manage many changes. This is gross because it overburdens an individual.
  • Option Two: Force all people to manage one change. This is even worse because it requires everyone to fiddle with code they may not even care about.
  • Option Three: Never write code which becomes popular for edit. This is obviously kind of dumb.
  • Option Four: Never reorganize code. Even more dumb.
At the root of the problem is the way version control on source code currently works. We track textual differences on a line-by-line (or token-by-token) basis. The tools for diffing, merging, history tracking, and integration are all fundamentally ignorant of what the code means.


What if this were not true?


Suppose we built some language that did not store code in flat text files. Instead, it divides code into a strict hierarchy:

Project -> Module -> Function -> Code

You could insert classes or other units of organization between "modules" and "functions" if the language is so designed, of course.

Now, suppose that instead of having all code in a module go into a folder, and all functions in the module going into text files in that folder, we just treat a module as a data unit on disk.

Within this unit, we have arbitrary freedom to stash code however we want. The IDE/editor/other tools would understand how to open this blob of data and break it up into classes, functions, and maybe even individual statements or expressions.

So here comes the interesting bit. Strap on your helmet, kiddies, we're going to go fast.
  • Assign each atomic unit of code (say, a function, or maybe even a statement if you want to get crazy detailed, but that's probably a bad idea) a GUID.
  • Store the data as a GUID followed by the textual code associated with that GUID.
  • Store alongside the data a presentation metadata model which describes how to show these units of code to the programmer. This should be fully configurable and rearrangeable via the editor UI.
  • Each of the smallest level-of-detail objects gets stored in a separate file, identified by its GUID.
Given a set of code attached to a GUID, we no longer show revisions in the version control system as edits to a file. Instead, we show them more granularly, following the organizational hierarchy defined by the language: project, module, function, code. The project as a whole is grouped as a tree, allowing easy visualization of the hierarchy. Open a single node and you can see all changes relevant to that node and its children, on any level of granularity you like.

This sidesteps our original problem in two interesting ways. First, if we just want to reorganize code in a module without changing its functionality, we can do so by modifying the presentation metadata alone. This allows even an old dumb text-based merge utility to preserve our changes across arbitrary branches.

Second, and more fascinating, what if we want to reorganize code across modules? All we have to do is record that the code from one list of GUIDs moved from one module to another. If we don't store modules as folders, but instead as another layer of metadata, we can make another simple textual change that records the GUIDs belonging to one module GUID in revision A, and another module GUID in revision B.

Why GUIDs? Easy: it allows us to rename any atomic unit of code, or any larger chunk of code units, arbitrarily without breaking any of the system. Delete a function? No problem! Just remove its GUID from version control history like you would have deleted a file in the old approach. Add a function? Also no problem; it just becomes a new file in source control. Move a function around to another file or module or even project? Who cares?! It just changes metadata. Not the code itself.


So let's bring it all together. What if we want to have the exact original scenario? Programmer A makes several changes to organization of some module Foo. Meanwhile, programmers B through Z are making changes to the implementation of Foo, on different code branches.

Merging all of this becomes trivial even under existing integration tool paradigms, because of how we decided to store our code on disk. Even cooler, anyone can engage in reorganizations and/or functionality changes without stomping on anyone else when it comes time to merge a branch upstream.


All this requires is a little lateral thinking and some tool support for the actual code editor/IDE. If you want to support things like browsing the code repo from the web, all you need to do is add a tool that flattens the current project -> module -> function -> code hierarchy into an arbitrary set of text files - again, trivial to do with a little metadata.



The more I think about this approach to version control, the more convinced I am that Epoch is going to try it out. Flat text files are dumb and outmoded; it's time we used computers to do our work for us, the way it was always meant to be.




This is essentially exactly what we did with the content streams system, and it's one of the primary reasons we did it.

 

We also discussed how this could be extended to code; essentially, you pretend the code is "content", in the form of the AST. Thus, you can do merges at the semantic level instead of the dumb textual level that version control systems typically support.

This has got to be one of the most common conversations I have at work. The strange thing is that with everyone talking about it, no one has actually done it yet...

 

I think it is worth mentioning that your described storage format is pretty much the structure of a git repository (file -> blob, metadata -> tree, etc). Might as well avoid reinventing the wheel, and just use git as your native storage format.

Check this out, not sure if it's 100% in line with what you describe: http://www.semanticmerge.com/

I wonder if very descriptive but concise commenting can make all the difference in version control.   If the source control software is made to look for key words, sort of flags if you will, in the commenting then this could direct the software to highlight the things which matter the most to the developer.

PARTNERS