Let me preface this by saying that this site might not be the best location for this discussion, since it is not at all game development related. However, over the years I have been reading (and a long long time ago, on an account long lost even posting) discussions here I have come to respect the opinions of regulars here, and have learned a great deal about programming from them, so I believe a good discussion on any programming topic can be had here. Nevertheless, if you have suggestions where else on the web this topic might spark interesting discussion, suggest away.
TL/DR version:
I'm considering using code generation to generate a large amount of the code of an application written in C++. The generated parts would mainly be in the data layer, where different objects, their data, and their relationships are encoded, persisted, can be accessed and manipulated. My goals are adhering to the DRY principle, high consistency (e.g. same type of relationship between objects behaves the same everywhere), easy refactoring - basically what every program should strive for. Obviously I could just create the appropriate abstractions and build a nice framework that accomplishes all that - and indeed, parts of the system already work that way. However, I'm proposing that for certain cases code generation could be a better alternative, offering performance and ease-of-use benefits without much drawback.
A little background on how I came to this idea: a while ago someone on this site suggested reading The Pragmatic Programmer in a thread, and I did (and wholly recommend to do). This inspired me to think about the application I'm building at work, and how it's architecture could be improved. I long knew that there is just too much groundwork involved in adding new types of objects to the application and the process is error prone, but struggled to come up with a solution without considerable drawbacks. Much of the data layer was repetitious: object A has attributes a, b, c, d, some have to be persisted, if A.a changes you have to reevaluate A.c, etc. There were definite patterns there, but they were applied "by hand" to every class. For example, every class has a persist method that handles persistence of data to a database. Another method handles serialization to memory. Another handles copying of instances. All three of those typically operate on the same set of properties, and indeed, can be abstracted to a single method that iterates over all properties and applies an action to them.
However, this implies storing the properties in a collection that can be iterated (which in turn implies some inheritance hierarchy or usage of a variant type), and storing metadata for every property, to be able to tell how/if certain actions should be applied to that property. These constructs add considerable overhead to the layer of the application that has to have the most performance (there can be millions of object instances, each with tens of properties, and while typically only several of them are affected by a single user interaction, some use cases call for updates to the data that can cascade through most of the instances).
This is where code generation comes in. The basic idea is the following: build a metamodel of the data layer, composed of:
- DataObject - the main building blocks, the objects themselves that we are generating code for
- DataObjectProperty - properties of data objects - typically type, name and tags (persisted, calculated, etc.)
- DataObjectRelationship - relationships between data objects (e.g. defined-by, depends-on)
Note that this is not a UML class model (although arguably could be modeled as such), as it is domain specific and cannot (and should not) express the needs of any but this one (family of) application.
This meta-model can be populated any number of ways (indeed, I originally started with an idea of a DSL to decribe the data layer, then read Fowler's excellent material on the topic and he pointed out that the key concept is not the DSL but the model the DSL is populating). The meta model is then processed and code is generated from it for all the functionality the objects need to have, that can be expressed by rules driven by the metadata.
Advantages of this method:
The resulting C++ code has no performance drawbacks. Abstraction is only added where it makes sense, and not solely for reuse. This also makes understanding/debugging the code easier. Note that the code looks like it has large blocks copy-pasted, but in fact the abstraction is still there - it's just on the side of the code generator.
Disadvantages of this method:
If you want to understand the rules that drive the application, you have to delve into the code generator and the DSL, which requires extra knowledge. Parts of the code still have to be hand-written, and there must be a mechanism for separating hand written and generated code, as well as mechanisms to allow generated code to call into hand-written code (for functionality that cannot be expressed in the basic rules fully). The build process gets more complicated.
There is still lots of detail I have omitted here, if you are interested I am happy to discuss parts of the current system and how I'd like to refactor it in more detail. Thank you for reading and thanks is advance for any insight.