Jump to content

  • Log In with Google      Sign In   
  • Create Account

The Bag of Holding

Speculation on semantically-aware version control

Posted by , 17 October 2013 - - - - - - · 846 views

Working on a massive code base comes with some interesting challenges.

Consider this scenario, loosely based on a real-world problem we recently encountered at work:
  • Create a multi-million line code base
  • Divide this into over two dozen branches for various independent development efforts
  • On some given branch, find some popularly-edited code that needs reorganization
  • Move some functions around, create a few new files, delete some old crufty files
Given these changes, how do you cleanly merge the reorganization into the rest of the code tree? Between the time of the reorg and the merge, other people have made edits to the very functions that are being moved around. This means that you can't just do a branch integration and call it good; everyone has to manually recreate the changes in one way or another.

Either the programmer who did the reorg is burdened with manually integrating all the smaller changes made from other branches into his code prior to shipping his branch upstream, or every programmer across every branch must recreate the organizational changes with their own changes intact.

Obviously part of this is just due to the underlying issue that one file is popular for edits across many branches. But the clear solutions to the problem are all annoying:
  • Option One: Force one person to manage many changes. This is gross because it overburdens an individual.
  • Option Two: Force all people to manage one change. This is even worse because it requires everyone to fiddle with code they may not even care about.
  • Option Three: Never write code which becomes popular for edit. This is obviously kind of dumb.
  • Option Four: Never reorganize code. Even more dumb.
At the root of the problem is the way version control on source code currently works. We track textual differences on a line-by-line (or token-by-token) basis. The tools for diffing, merging, history tracking, and integration are all fundamentally ignorant of what the code means.

What if this were not true?

Suppose we built some language that did not store code in flat text files. Instead, it divides code into a strict hierarchy:

Project -> Module -> Function -> Code

You could insert classes or other units of organization between "modules" and "functions" if the language is so designed, of course.

Now, suppose that instead of having all code in a module go into a folder, and all functions in the module going into text files in that folder, we just treat a module as a data unit on disk.

Within this unit, we have arbitrary freedom to stash code however we want. The IDE/editor/other tools would understand how to open this blob of data and break it up into classes, functions, and maybe even individual statements or expressions.

So here comes the interesting bit. Strap on your helmet, kiddies, we're going to go fast.
  • Assign each atomic unit of code (say, a function, or maybe even a statement if you want to get crazy detailed, but that's probably a bad idea) a GUID.
  • Store the data as a GUID followed by the textual code associated with that GUID.
  • Store alongside the data a presentation metadata model which describes how to show these units of code to the programmer. This should be fully configurable and rearrangeable via the editor UI.
  • Each of the smallest level-of-detail objects gets stored in a separate file, identified by its GUID.
Given a set of code attached to a GUID, we no longer show revisions in the version control system as edits to a file. Instead, we show them more granularly, following the organizational hierarchy defined by the language: project, module, function, code. The project as a whole is grouped as a tree, allowing easy visualization of the hierarchy. Open a single node and you can see all changes relevant to that node and its children, on any level of granularity you like.

This sidesteps our original problem in two interesting ways. First, if we just want to reorganize code in a module without changing its functionality, we can do so by modifying the presentation metadata alone. This allows even an old dumb text-based merge utility to preserve our changes across arbitrary branches.

Second, and more fascinating, what if we want to reorganize code across modules? All we have to do is record that the code from one list of GUIDs moved from one module to another. If we don't store modules as folders, but instead as another layer of metadata, we can make another simple textual change that records the GUIDs belonging to one module GUID in revision A, and another module GUID in revision B.

Why GUIDs? Easy: it allows us to rename any atomic unit of code, or any larger chunk of code units, arbitrarily without breaking any of the system. Delete a function? No problem! Just remove its GUID from version control history like you would have deleted a file in the old approach. Add a function? Also no problem; it just becomes a new file in source control. Move a function around to another file or module or even project? Who cares?! It just changes metadata. Not the code itself.

So let's bring it all together. What if we want to have the exact original scenario? Programmer A makes several changes to organization of some module Foo. Meanwhile, programmers B through Z are making changes to the implementation of Foo, on different code branches.

Merging all of this becomes trivial even under existing integration tool paradigms, because of how we decided to store our code on disk. Even cooler, anyone can engage in reorganizations and/or functionality changes without stomping on anyone else when it comes time to merge a branch upstream.

All this requires is a little lateral thinking and some tool support for the actual code editor/IDE. If you want to support things like browsing the code repo from the web, all you need to do is add a tool that flattens the current project -> module -> function -> code hierarchy into an arbitrary set of text files - again, trivial to do with a little metadata.

The more I think about this approach to version control, the more convinced I am that Epoch is going to try it out. Flat text files are dumb and outmoded; it's time we used computers to do our work for us, the way it was always meant to be.

Advice to a Young Programmer

Posted by , 01 October 2013 - - - - - - · 9,942 views

One of the awesome things about ArenaNet is that we run a programming internship program that actually does a phenomenal job of preparing people to work in the games industry. This is accomplished by focusing on three primary principles:
  • Everything you do will matter. There is no pointless busy-work, no useless coffee-and-bagel fetching type nonsense, and every project contributes directly to something that impacts the studio and/or the game as a whole.
  • Everything you do will be reviewed. We have at least one senior programmer per intern dedicated to helping make sure that the work is top-notch. This entails exhaustive code reviews and extended design/analysis discussions before a line of code is even written.
  • Whether we end up hiring you or not, we're committed to making sure you leave the program as a good hire. The program is not a guaranteed-hire affair. However, with extremely few exceptions, we ensure that upon completion of the internship you're well prepared and ready to tackle game industry jobs.
One of the interns going through the program right now is assigned to me, and I've found it an awesome opportunity not just to mentor a more junior developer, but to force myself to crystallize and refine my own thinking so I can communicate it clearly to someone who doesn't have the benefit of many years of experience on the job.

Last week I had the chance to write down some thoughts about his performance so far in the program, and offer some feedback. After sitting down to re-read my letter, it struck me that there's a lot of stuff in there that might be useful for anyone who is just learning to work on large projects with large teams.

You may not be working on the next great MMO (or maybe you are!) but I think there's some value in this advice for anyone who is early on in their programming career.

Think Twice, Commit Once
This is my variant of the old carpenter's rule of "measure twice, cut once." In general, one of the biggest challenges of working on large-scale projects is keeping in mind all the ramifications of your decisions. Sometimes those implications are easier to see, and sometimes there's just no way to know ahead of time. But either way, it pays to take some time when writing code (and after writing code) to think very carefully about it.

One thing I personally like to do is let my non-trivial changelists sit for a day or so and then come back to them with a fresh mind, and re-read the code. I try to approach it as if I'd never seen the code before and was responsible for a code review on it. There are two directions that need to be considered: the tiny details, and the big-picture implications. Usually I will do two passes, one for each frame of mind.

I'll cover some of the small details later; the big things are generally harder anyways. It takes a lot of practice and experience to intuitively spot the consequences of design decisions; this has two important facets. First, it means that it won't be immediately obvious most of the time when you make a decision that has large-scale effects. Second, it means that it will take effort and conscious deliberation to train yourself to recognize those situations. The best suggestion I can offer is to pause often and try and envision the future - which I'll tackle next.

Be Nice to Future You
I could also say "be nice to everyone else who will ever read your code" but that doesn't make for as nice of a section heading. The general idea here is that code is written once and then lives a very, very, very long time. The natural impact of this is that people will have to read the code many times while it survives. Again there are two directions this can go in: tiny details, and large-scale impacts, and again, the details are easier to spot - especially at first.

Some concrete examples are probably in order at this point. For details, one of the things that goes a long way is simple formatting. It may seem almost overbearingly anal-retentive to complain about whitespace and which line your braces go on, but it has an impact. After twenty-odd years of reading code, you get to a point where you can recognize common patterns very easily. Especially in a workplace like ArenaNet with a strict and consistently-followed formatting standard, this is a major time-saver; if everyone writes code that looks similar, it's easier to spot "weird" things. If my brain is distracted while reading a piece of code by the formatting, it gets harder to focus on the meaning and intent of the code itself.

On the larger scale, there are things like comments. Code commenting is a religious warfare issue in most of the world, and even ArenaNet has lots of diverse viewpoints on how it should be done. However, we have some basic philosophical common ground that is very helpful.

Sometimes comments need to go away, and sometimes they need to be added. Comments that need to go away are generally guilty of at least one of the following crimes:
  • Repeating what the code already says
  • Being out of date or in imminent danger of becoming out of date
  • Being outright false (often as a result of accidents with copy/paste)
Things that are good to comment are largely architectural: what modules do, how they fit together, what the intent of each major section of code is. The details ("this line of code adds this column to this CSV file") are already pretty obvious from the code itself - or at least, they should be, if the names used in the code are clear. Which leads to...

A Rose By Any Other Name Smells Like Shit
This is a lot more detail-oriented stuff. There are a number of conventions that any team will typically have that are important and help readability and clarity of intent.

There are many specific examples in ArenaNet's code standards documentation, but there are other conventions that are more general and likely to be in use in almost any environment. For example, pluralization is important. Iterator variables should be singular ("item" or "currency" or "character"). Containers should be plural ("items" or "currencies" or "characters"). The same goes for function names; if a function does one thing, name it singularly. If it does multiple things, or does one thing to multiple things, pluralize appropriately.

In general names are hard to select and should be chosen with great care. If a function's purpose changes, change its name as well. Always make sure a name corresponds to the best description of intent that you can manage. (Just don't use extremelyVerboseVariableNamesWithLotsOfExtraneousDetails.)

Wastefulness Will Bite You Sooner Rather Than Later
Extra parameters that aren't used by a function should be removed. (There are compiler warnings for this, by the way - you should always build with warnings as errors and the maximum warning level that you can manage.) Similarly, make sure that all variables are actually used and do something. Make sure that function calls are useful. For example, initializing a variable and then immediately changing its value just creates extra noise that confuses the reader. Passing nothing but a string literal to sprintf() and then printf()ing the resulting buffer is confusing as well, and a little wasteful.

This is partly about readability and partly about real efficiency. In large code bases like ours, the death isn't from a single massively wasteful piece of code - it's from thousands of tiny decisions that add up over time... both in terms of reading code, and in terms of how it performs at runtime. Both memory and processing power are something to keep in mind. They may seem cheap (or even free) at times, especially in environments with immense resources. But don't forget that everything has a cost and those costs accumulate a lot faster than we might wish sometimes.

A corollary to this is cleanup practices. If you remove a piece of functionality, make sure all vestiges of it are gone - comments, preparatory code, cleanup code, etc. It's easy to forget pieces of logic when removing things, and this just leads to more noise and confusion for the next reader. Once again, re-reading your own code with a critical eye helps a lot here.

Give It A Nice Home
Where things live in code as well as data is always an important consideration. Some things don't need to be wrapped into classes - if there's no state being carried around, or no shared interface that must be used, favor free functions instead of building classes that are just methods. On the data side, make sure to scope things as tightly as you can, and declare things as close as possible to their first point of use. Some things don't need to be file-level variables (let alone globals), and can just live as locals in some function. Some things don't even need to live the entire lifetime of a function. RAII is a big deal in C++, and should be used liberally.

File organization is also a big thing to keep in mind. Keeping related bits of code in self-contained files is a good habit; but it takes some careful thought to decide the taxonomy for what is "related." Think of programs like a pipeline. Each segment of pipe (module) should do one very specific and very contained thing. To build the whole program, you link together segments of pipe (modules/classes/etc.) and compose them into a more sophisticated machine.

Follow the Leader...
You should always try and emulate the code you're working in if someone else owns it. Follow the style, the naming patterns, the architectural decisions, and so on. Often you can make your life harder by departing from established convention, and you will definitely make the lives of everyone else harder at the same time.

Don't be shy to ask questions if any of those decisions or patterns are unclear. I know that a lot of this stuff seems a bit vague and mystical right now; much of the reasoning for why things are the way they are may not be immediately apparent. Asking questions is always good - if there are reasons for why things are a certain way, then you get to learn something; and if there are no reasons, or bad reasons, you open up an opportunity to make things better.

...But Clean Up What You Find
One of the highest aspirations we should have as team programmers is to leave code better than we found it. This ranges from fixing minor details to cleaning up design decisions and adding documentation. This is something you should try for even if you're not in your "own" code - often if there is objective room for improvement, the owner will actually appreciate you taking the time to make his area nicer. Obviously this doesn't mean you should go on a rampage to change every code file just for the sake of it, and there's always value in consulting with an owner before making changes - but you get the idea.

Learning Never Stops
It can be very tempting in life to plateau. Sometimes we just want to feel like we have "arrived" and now we "get it." And indeed there will be milestones in your career where you can notice a profound change in your skills and perspective.

The key is to never stop chasing the next boost. Even after more than twenty years of writing computer programs, I learn new things all the time. Your learning process is either constantly running, or you're effectively dead.

And that, of course, leads to the final piece of advice I could ever offer anyone on life: "don't be dead."

October 2013 »