Code generation vs. Framework

Started by
13 comments, last by Alberth 8 years, 8 months ago

Let me preface this by saying that this site might not be the best location for this discussion, since it is not at all game development related. However, over the years I have been reading (and a long long time ago, on an account long lost even posting) discussions here I have come to respect the opinions of regulars here, and have learned a great deal about programming from them, so I believe a good discussion on any programming topic can be had here. Nevertheless, if you have suggestions where else on the web this topic might spark interesting discussion, suggest away.

TL/DR version:

I'm considering using code generation to generate a large amount of the code of an application written in C++. The generated parts would mainly be in the data layer, where different objects, their data, and their relationships are encoded, persisted, can be accessed and manipulated. My goals are adhering to the DRY principle, high consistency (e.g. same type of relationship between objects behaves the same everywhere), easy refactoring - basically what every program should strive for. Obviously I could just create the appropriate abstractions and build a nice framework that accomplishes all that - and indeed, parts of the system already work that way. However, I'm proposing that for certain cases code generation could be a better alternative, offering performance and ease-of-use benefits without much drawback.

A little background on how I came to this idea: a while ago someone on this site suggested reading The Pragmatic Programmer in a thread, and I did (and wholly recommend to do). This inspired me to think about the application I'm building at work, and how it's architecture could be improved. I long knew that there is just too much groundwork involved in adding new types of objects to the application and the process is error prone, but struggled to come up with a solution without considerable drawbacks. Much of the data layer was repetitious: object A has attributes a, b, c, d, some have to be persisted, if A.a changes you have to reevaluate A.c, etc. There were definite patterns there, but they were applied "by hand" to every class. For example, every class has a persist method that handles persistence of data to a database. Another method handles serialization to memory. Another handles copying of instances. All three of those typically operate on the same set of properties, and indeed, can be abstracted to a single method that iterates over all properties and applies an action to them.

However, this implies storing the properties in a collection that can be iterated (which in turn implies some inheritance hierarchy or usage of a variant type), and storing metadata for every property, to be able to tell how/if certain actions should be applied to that property. These constructs add considerable overhead to the layer of the application that has to have the most performance (there can be millions of object instances, each with tens of properties, and while typically only several of them are affected by a single user interaction, some use cases call for updates to the data that can cascade through most of the instances).

This is where code generation comes in. The basic idea is the following: build a metamodel of the data layer, composed of:

- DataObject - the main building blocks, the objects themselves that we are generating code for

- DataObjectProperty - properties of data objects - typically type, name and tags (persisted, calculated, etc.)

- DataObjectRelationship - relationships between data objects (e.g. defined-by, depends-on)

Note that this is not a UML class model (although arguably could be modeled as such), as it is domain specific and cannot (and should not) express the needs of any but this one (family of) application.

This meta-model can be populated any number of ways (indeed, I originally started with an idea of a DSL to decribe the data layer, then read Fowler's excellent material on the topic and he pointed out that the key concept is not the DSL but the model the DSL is populating). The meta model is then processed and code is generated from it for all the functionality the objects need to have, that can be expressed by rules driven by the metadata.

Advantages of this method:

The resulting C++ code has no performance drawbacks. Abstraction is only added where it makes sense, and not solely for reuse. This also makes understanding/debugging the code easier. Note that the code looks like it has large blocks copy-pasted, but in fact the abstraction is still there - it's just on the side of the code generator.

Disadvantages of this method:

If you want to understand the rules that drive the application, you have to delve into the code generator and the DSL, which requires extra knowledge. Parts of the code still have to be hand-written, and there must be a mechanism for separating hand written and generated code, as well as mechanisms to allow generated code to call into hand-written code (for functionality that cannot be expressed in the basic rules fully). The build process gets more complicated.

There is still lots of detail I have omitted here, if you are interested I am happy to discuss parts of the current system and how I'd like to refactor it in more detail. Thank you for reading and thanks is advance for any insight.

Advertisement
Both, either one, or neither.

The route of your generated code isn't too terrible, that's what languages have been doing since the invention of programming languages. You take one source definition, process it, and come out with a useful output. That output may be a game level file, a zip file, a runnable executable, a movie, or another source code file, to name a few. The tool serves as an abstraction. It takes repetitive pieces of functionality and gives it a more accessible interface. If that abstraction helps you think about your process or build things faster or more efficiently, go for it.

The route of a code framework is a similar abstraction, you are taking pieces of functionality and giving it more accessible interface. The big difference is that it is less transformative. You are sticking within the same set of tools and languages rather than a tool that maps your scripting into generated source.

Generally I prefer to stick with a framework since I can do the work inside existing tools. But sometimes it does make sense to build your own tools, some custom grammars and interpreters and generators. Whatever works for you.

I created a framework for defining "nodes", which are objects that have input and output properties that can be connected together using a visual graph editor (similar to UE4's blueprint system). Defining a node type requires giving each property a name, a classification (input/output), a type, and "affects" relationships with other properties. Serialization is done automatically by the framework. Also, each node has a "Calculate" method which is invoked by the framework to update an output property when it's dirty (it becomes dirty when a property which "affects" it changes).

As soon as the framework was more or less functional I started defining some node types. An "Add" node which adds its two input properties and outputs the sum, a "MeshLoader" node which has an input string designating the mesh file name and an output property which contains the loaded mesh... etc. This very quickly became too tedious to be used effectively for anything consequential without lots of mindless typing. This could be somewhat alleviated with the use of macros, but reading your post, I immediately thought that using code generation could be a more handy solution.

I don't know what I'm trying to say... just what I thought of when reading your post.

Obviously, code generation is an option. In fact, it's being done today, the entire Eclipse Modeling Framework is about meta-models (using the EMF meta-meta-model), and generating code from it.

However, it's not true that generated code has not many drawbacks. The costs are however not so much in the generated code itself, but in the meta-code, so to speak (that is, in the code generator application).


The biggest cost (if you have never done this) is complexity. Instead of writing data structures that solve the problem, you write data structures that represent data structures that solve the problem. In other words, you work a level higher in abstraction. That makes the application inherently more difficult (you are solving the problem for a BIG set of applications at the same time).

The second cost that you have is type checking. A user using your code generator (ie you) wants errors like "'foo' is missing at line 34", "field 'blah' has already been used in base class 'bleh' at line 45", "cannot initialize boolean variable 'blip' with real number at line 46", or "array length cannot be negative (found length -5)". To get such errors, you have to consider what the user can do wrong, how to detect it, and then report it.
If you want more fancy things like expressions ("x[1+2]" rather than only "x[3]"), or variables or parameters (that are used during code generation, not variables that exist in the generated code), the challenge increases.


I have been doing code generation for a lot of years. In my experience, making a nice input language is a puzzle (perhaps less in your case, as you only have data structures to worry about), but by far the type checker takes the most time to build (but I do all the fancy stuff too).

If you use the code generator for several applications, it should become viable at some point (I always use 3 as count, although it's non-scientifically found based on my frustration level of having to do the same thing AGAIN). However, if you plan on using the code generator for this one application, you're not going to be finished sooner. In that case, you're solving the problem at meta-level for a lots of possible application, and then take 1 instance of it.

Maybe the best indication of feasibility is the fact I don't have a code generator like you propose (and mine would be even simpler, I don't need persistence, I'd be happy with nice and simple class definitions only). Whenever I start a new program, and have to write a zillion classes, I always wish I have a generator. Pretty soon however, reality sinks in. I need a week to write such a beast. I can write my classes in 2 days with a lot of copy/paste. I hardly ever have to change them again once they are done (no re-use of code generation). I'd need 1/2 a day to write an input specification for the code generator. So I loose 1.5 day by not having it, whereas making it would be 7 days, and then I still need 1/2 a day to write the input spec of the application I wanted to make :p
(Ah well, perhaps one day I'll be fed up enough, and just start.)

Of course, if learning is one of the goals, the time spent on it is hardly interesting :)

i do all of my programming in a macro language which generates c++ code. the macro language supports both one to one and one to many associations between macro code and c++ code, so you aren't abstracting away anything and have full low level control when needed. the code gen does almost no error checking - that's left to the c++ compiler, so there are no issues with error checking increasing the complexity of the code gen. to date i've used the code gen on about 12 programs and am currently developing 3 large games using it. the syntax is designed so there are no keyword collisions between the macro language and c++, so macro script and c++ code can be freely mixed in the same macro script source file. i developed the macro language and code gen to cut down on typing. code entry is the biggest bottleneck in my code development. you might want to consider use of a codegen wizard. a small wizard app that generates a chunk of code that you then copy and paste into your existing source. use the wizard to do the gruntwork typing for you. the really slick trick is a macro language IDE + wizard codegens. basically an editor with buttons that launch codegen wizards that insert code at the cursor. you code in your macro script or c++, and use a wizard to insert code when you can, instead of typing it yourself. personally, i'd start with a small stand alone wizard to generate those similar chunks of code you referred to and just copy/paste them into your editor. add more functionality (more wizards, a macro language and translator, etc) as needed.

my macro language:

http://www.gamedev.net/gallery/image/4227-cscript-code-generator/

a continuous large open world 3D role playing simulation game made with my macro language (~110K lines of C++ source once translated):

http://rocklandsoftware.net/beta.php

Norm Barrows

Rockland Software Productions

"Building PC games since 1989"

rocklandsoftware.net

PLAY CAVEMAN NOW!

http://rocklandsoftware.net/beta.php

Thank you all for your responses.

The route of your generated code isn't too terrible

At least it's just a little terrible then wink.png On a more serious note, I pretty much agree with your assessment - I also prefer internal frameworks when possible, that's why I was a little reluctant to start down this path. However, I'm trying to be pragmatic here and not discard any option if it makes life easier for me and my coworkers.

...

As soon as the framework was more or less functional I started defining some node types. An "Add" node which adds its two input properties and outputs the sum, a "MeshLoader" node which has an input string designating the mesh file name and an output property which contains the loaded mesh... etc. This very quickly became too tedious to be used effectively for anything consequential without lots of mindless typing. This could be somewhat alleviated with the use of macros, but reading your post, I immediately thought that using code generation could be a more handy solution.

Your problem seems a little different then mine, however I think you could benefit from taking a look at Domain Specific Languages. For me, the key part is code generation from an abstract model, in your case the issue seems to be the verbosity of the general purpose language you are trying to use your framework with. Meaning you already have your abstract model, but populating it with data is too tedious in the language it is written in. DSLs seem like a good fit for this job. Note that the use of a DSL doesn't even necessarily require code generation. I recommend you do a little research into DSLs - most of what I know about them comes from Martin Fowler's Domain Specific Languages book, which is a very good read IMO. I'm sure there are lots of other sources too.

Maybe the best indication of feasibility is the fact I don't have a code generator like you propose (and mine would be even simpler, I don't need persistence, I'd be happy with nice and simple class definitions only). Whenever I start a new program, and have to write a zillion classes, I always wish I have a generator. Pretty soon however, reality sinks in. I need a week to write such a beast. I can write my classes in 2 days with a lot of copy/paste. I hardly ever have to change them again once they are done (no re-use of code generation). I'd need 1/2 a day to write an input specification for the code generator. So I loose 1.5 day by not having it, whereas making it would be 7 days, and then I still need 1/2 a day to write the input spec of the application I wanted to make tongue.png
(Ah well, perhaps one day I'll be fed up enough, and just start.)

Yes, I very conveniently ignored the cost of building the code generator itself in the equation. Couple of reasons:

  1. In my case the ratio of generated code / code generator code is way higher. I would generate hundreds of classes this way.
  2. Copy/paste is absolutely out of the question, as one of the existing problems is the amount of work needed to change behaviour consistently accross all the classes, or do large-scale refactoring. My classes absolutely have to change, and change often.
  3. I'm hoping to reduce the complexity of the generator by keeping things very simple. I might not even use a DSL to populate my metamodel, at least at first, so no parsing needed. Even if I use a DSL, I'm not going to try and express complicated rules in my language - essentially trying to keep it as close as possible to just annotated data. In essence, the user (programmer) will be able to choose a predefined set of behaviours for the objects/properties/relationships, but not define new behaviours in the DSL. That will only be possible by changing the generator itself.

Of course, if learning is one of the goals, the time spent on it is hardly interesting smile.png

Learning is always a goal, just not the goal in this case smile.png

i do all of my programming in a macro language which generates c++ code. the macro language supports both one to one and one to many associations between macro code and c++ code, so you aren't abstracting away anything and have full low level control when needed. the code gen does almost no error checking - that's left to the c++ compiler, so there are no issues with error checking increasing the complexity of the code gen. to date i've used the code gen on about 12 programs and am currently developing 3 large games using it. the syntax is designed so there are no keyword collisions between the macro language and c++, so macro script and c++ code can be freely mixed in the same macro script source file. i developed the macro language and code gen to cut down on typing. code entry is the biggest bottleneck in my code development. you might want to consider use of a codegen wizard. a small wizard app that generates a chunk of code that you then copy and paste into your existing source. use the wizard to do the gruntwork typing for you. the really slick trick is a macro language IDE + wizard codegens. basically an editor with buttons that launch codegen wizards that insert code at the cursor. you code in your macro script or c++, and use a wizard to insert code when you can, instead of typing it yourself. personally, i'd start with a small stand alone wizard to generate those similar chunks of code you referred to and just copy/paste them into your editor. add more functionality (more wizards, a macro language and translator, etc) as needed.

That is an interesting approach, however once again, your issue is a different one than mine: my goal isn't necessarily to have to type less, that is just a nice bonus. This approach would help with one of the issues - a different programmer implementing a concept incorrectly, which would be harder to do if he could use a wizard. However, I think it is crucial for the code generation process to not be a one-off action. The authoritative source on the generated code parts has to be the generator + the data it is fed, and the process has to be repeatable. Otherwise refactoring the concepts the generator uses to generate the code is impossible - e.g. a new issue requires you emit different code when the object is persisted (perhaps you realized that storing your object relationships a different way could bring you a performance benefit). If the code cannot be regenerated, you need to rewrite all of it by hand (technically, you use your modified wizard to copy-paste segments in, but if there are hundreds of classes, this is still too tedious and error prone). Indeed, if the code cannot be regenerated, you might not even get to try your wonderful new idea, because you won't have the time to make the change and test if it performs better. Of course YMMV.


If you use the code generator for several applications, it should become viable at some point (I always use 3 as count, although it's non-scientifically found based on my frustration level of having to do the same thing AGAIN). However, if you plan on using the code generator for this one application, you're not going to be finished sooner. In that case, you're solving the problem at meta-level for a lots of possible application, and then take 1 instance of it.

I forgot to address this point in my previous reply, and I think it's an important one. Technically, what I am building is indeed one application. However, that application will go through hundreds of iterations until it is released, and likely thousands more after that (which is hard to imagine when you are a game developer, but the reality we less fortunate have to deal with smile.png ). So in a sense it is a myriad of applications where the code generator would hopefully have a chance to pay for itself many times over.

I fully agree on your latter post. I generalized the 3-idea to 'working with computers', and thus write an automagic solution first, whenever I think I'll need to do it 3 times or more. So far, it's always used much more often than anticipated beforehand :)

If you are going to look at code generation you should always keep the following in mind :)

https://xkcd.com/1319/


Whenever I start a new program, and have to write a zillion classes, I always wish I have a generator.

cscrpit (my macro language) will do that. i recently added keywords to implement c++ class definitions, stuff like class, public, and private keywords.

i've been thinking of releasing it for free. it's designed primarily to save keystrokes, something like it might help in such cases.

Norm Barrows

Rockland Software Productions

"Building PC games since 1989"

rocklandsoftware.net

PLAY CAVEMAN NOW!

http://rocklandsoftware.net/beta.php

This topic is closed to new replies.

Advertisement