• Advertisement
Sign in to follow this  

What is Data Oriented Programming?

This topic is 2767 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I've just gotten my head around the concept of OOP, when someone on this forum posted a link to DOP... I have no idea what DOP is, never heard of it before. So I read the link, and I didn't understand it. I googled for more notes/articles on DOP but didn't understand any of them. They all barraged me with a mass of technical details at me and the concept got lost.

What is Data Oriented Programming?

Share this post


Link to post
Share on other sites
Advertisement
Googling found me this, which looks fairly comprehensive and easy to understand.

The information I gleaned from my 5 minute gloss-over is that Data Oriented Programming allows you to de-couple tightly-bound architecture into one that is 'driven' by data rather than explicit and static interfaces definitions. Think dependency injection with data rather than, e.g. static spring files. It's not a replacement for OOP, it's at a higher level than that.

On page 10 they have a diagram of a producer and a consumer with the interface between being a data model rather than an explicit definition. This implies loose coupling between components which would make bespoking the architecture far easier and within the reach of the end user rather at the initial design stage.

It's an interesting concept from a business enterprise application perspective but I'm not sure it would suit game development unless it was perhaps some kind of game designer application.

Share this post


Link to post
Share on other sites
it COULD also be related to optimizations like:
This

In a very similar fashion to optimizing code for SSE instructions, you try to bunch used data close together.
Large structures that conglomerate data together result in processes that have signficantly low data usage rates.
Consider any class with a "bool needsUpdate" member. In a loop like:

for ( all creatures )
if ( creature->needsUpdate )
creature->Update();

You end up reading in 128bytes of cache line to read a single bit, and if it is false, you discard all 128bytes you just read in.

This guy goes through a lot more of it focusing on a component archetecture, and explains why breaking up your data into more focused components can make things much faster.

Share this post


Link to post
Share on other sites
To quote Mike Acton:
Quote:
the point of any program is simply to transform data from one form into another and nothing else. And as one "solution" which ignores the real core problems of development is developed and others over time are built on top of that idea, and so on, we're left with systems that are over-designed, perform poorly and simply do not accomplish what they intended to in the first place - and certainly not well. I continue to suggest that we all take a step back from what we're doing and the methods we're using to solve problems and try to remember what the real issues are.


Check out pitfalls of OOP for an example where designing around data (instead of designing around code) can give big performance benefits.

Share this post


Link to post
Share on other sites
I assume you read this article?

To me, it looks like the author ends up with a component-based design, where each component type is stored in and processed by an appropriate system. Renderables are processed en-masse by the rendering system, physical objects are stored in and processed by the physics engine, and so on. Game objects only hold references to the components they consist of: for example, an enemy would reference a renderable, physical and AI component, while a trigger would only have a physical and a script/action component.

I never looked at it from a performance angle, but it makes sense.

Share this post


Link to post
Share on other sites
Do you still use classes in DOP?

Is DOP a different way of organising and constructing your classes with a view to making your code more efficient?

Share this post


Link to post
Share on other sites
Quote:
Original post by N3wbie
Do you still use classes in DOP?

Is DOP a different way of organising and constructing your classes with a view to making your code more efficient?



In general your applications will mix OOP and DOP. DOP is more for dealing with large quantities of data and efficiently working with them, where OOP works well for everything else.

I only use DOP in specific classes or sub systems, later in the cycle of an app one may find it appropriate to use for optimization passes too.

KulSeran's example is the typical one most developers run into. The larger the class is you are looping through the worse your data access is. Cache misses can be detrimental to apps trying to be efficient and DOP takes into consideration the layout of your data and how the system actually accesses it.

Share this post


Link to post
Share on other sites
Quote:
Original post by Hodgman
To quote Mike Acton:
Quote:
the point of any program is simply to transform data from one form into another and nothing else. And as one "solution" which ignores the real core problems of development is developed and others over time are built on top of that idea, and so on, we're left with systems that are over-designed, perform poorly and simply do not accomplish what they intended to in the first place - and certainly not well. I continue to suggest that we all take a step back from what we're doing and the methods we're using to solve problems and try to remember what the real issues are.


Check out pitfalls of OOP for an example where designing around data (instead of designing around code) can give big performance benefits.

My god! That is the actual article I was searching for and didn't find when I was posting. Thanks.

Share this post


Link to post
Share on other sites
Hmmm, so would I be right in saying that 2 of the key concpets are:

1. Organize your code in terms of data structures.

2. Use vectors to store your data structures wherever possible.

So if I have a simulation of a solar system, I might have a vector of sizes, a vector of masses, a vector of planet types, a matrix of positions, etc?

What about if I wanted to model the orbits using a Newtonian system? I guess I would have a matrix representing the graviational field acting on each massive item in the solar system?

And what about the relationship between the field and the position over time? Would this be a function or as I think I read somewhere a matrix transformation representing this relationship?

Would this be preferable to OOP? Isn't this just good procedural coding? I mean with a bit of thought that's probably the approach I would have taken anyway if I weren't using OOP. In fact, though I can see the massive benefits of classes, I find the restrictions they have too restrictive making coding more of a chore, and that you spend too much time worrying about them and the language syntax. I think I would likely prefer to code the way I just described.

[Edited by - N3wbie on June 25, 2010 10:27:24 AM]

Share this post


Link to post
Share on other sites
It does sound a lot like a fusion of structural and oo techniques....Perhaps the oo guys could not stomach admitting that structured programming does in fact have its own merits and they renamed it to avoid the problem.


:}

Share this post


Link to post
Share on other sites
Quote:
Original post by N3wbie
Hmmm, so would I be right in saying that 2 of the key concpets are:

1. Organize your code in terms of data structures.

2. Use vectors to store your data structures wherever possible.

So if I have a simulation of a solar system, I might have a vector of sizes, a vector of masses, a vector of planet types, a matrix of positions, etc?

from what I've read so far on the T=Machine thing posted earlier, I don't think so. I'm not finished yet, so I'll come back and update on whether I still think that with more detail.

Share this post


Link to post
Share on other sites
Quote:

1. Organize your code in terms of data structures.
2. Use vectors to store your data structures wherever possible.


I would say that (1) is inaccurate, I would say you organise your code in terms of data access patterns.
Also (2) only really works if you know the maximum number up front. If you use a vector without preallocating then all your objects will be invalidated at some point. I guess you could store indices in your objects instead of pointers.

I think people generally find OOP easier because it often fits the mental model of the problem they are trying to solve. Take the example you have just given, you might start by thinking about the solar system, then you see that it is made up of Planets which all share certain identifiable properties, position, mass, etc. BANG! There you have your data structure!

The problem with this approach is that it doesn't fit so well with the memory access model we have in most modern machines. It takes a second step to take your conceptual objects and convert them into the best format for the computer to deal with. It is a different way of thinking particuarly if you are used to following OOP principles.

Also I'm not sure the DOP paradigm outlined in the paper is the same as the approach outlined by Tony Albrecht and others for optimising memory access. There is crossover but the scope is quite different.

Share this post


Link to post
Share on other sites
From my not very fruitful Google searching I found the following link to a paper which seemes to present DOP for the first time. http://portal.acm.org/citation.cfm?id=947713 . Sadly I can't access the actual paper to read it, but by the looks of the publishing date, 1980, its not a new concept at all. It seems strange that there are a fair few articles arguing its superiority over OOP but not much else than that. If its so good and so old an idea, why isnt there more material on it? Perhaps because it never became the standard like OOP did? Why not?

[Edited by - N3wbie on June 26, 2010 2:43:52 AM]

Share this post


Link to post
Share on other sites
Another term that turns up material in this area is dataflow programming
Quote:
Original post by Noggs
Quote:
1. Organize your code in terms of data structures.
2. Use vectors to store your data structures wherever possible.
I would say that (1) is inaccurate, I would say you organise your code in terms of data access patterns.
Yeah I'd agree with your correction. A lot of the time, if you organise for data-access, you might end up using vectors... But, treating 'use vectors' as a rule is learning the symptom instead of learning the rationale ;)
Quote:
I think people generally find OOP easier because it often fits the mental model of the problem they are trying to solve. Take the example you have just given, you might start by thinking about the solar system, then you see that it is made up of Planets which all share certain identifiable properties, position, mass, etc. BANG! There you have your data structure!
Yeah, object-modelling techniques (OOP) are very popular these days.
In a data-oriented view, you'd take your OO design, and inspect how related the members are concerning data access. Look at your program in terms of a collection of Input-Process-Ouput boxes, then look at which member variables different 'Proccess' parts of those boxes use.
e.g. Maybe your planet class has lots of static data, like atmosphere colour, earth-years-per-rotation, main elemental composition, etc... However, your 'Do Physics' process only needs to know velocity, position and mass. In this case, bundling all that other data into the same 'object' is just going to pollute the cache and hurt performance.
Maybe you'll make one array of PlanetPhysicsData, and another array of PlanetSupplementaryData. Another option would be to keep the physics data in the planet class, but move the supplementary data out into another class, and have the planet keep a pointer to it (then the you're only polluting the cache with a single pointer, instead of loads of unneeded data).

Often if you redesign for optimal data access patterns, you may even end up duplicating some fields.
e.g. if physics and grapics 'processes' both need to know the position, you could make an array of just positions. Alternatively, you could give the PlanetGraphical and PlanetPhysical classes their own cached copy of the position. Once all the PlanetPhysical objects have been updated, their positions are copied over into the PlanetGraphical array. This duplication also increases your options for parallelism, because you've got less shared data between 'processes'.

Quote:
Original post by N3wbie
...but by the looks of the publishing date, 1980, its not a new concept at all. It seems strange that there are a fair few articles arguing its superiority over OOP but not much else than that. If its so good and so old an idea, why isnt there more material on it? Perhaps because it never became the standard like OOP did? Why not?
You'll probably find a lot more threading articles that deal with mutual-exclusion than message-passing too, even though message-passing is almost as old and generally a better idea for high-level parallelism. People are resistant to change unless they're forced into it.

The current generation of consoles are providing the push towards thinking about data-access patterns, because, as shown in the 'pitfalls of OOP' earlier in this thread, relative to CPU speed memory has become 1000 times slower than it used to be.
If you profile nice, 'well designed' OO code on a modern in-order CPU (the CPU's in consoles don't rearrange instructions to hide memory latency for you) then the profiler might not show any bottlenecks, but you notice that it's just kind of slow across the board. Looking closer, you can see you've got tens of thousands of cache misses across the board, each one idling the CPU for a substantial amount of time (anywhere from a dozen to a thousand cycles).
With consoles being 5 years old now, and new games trying to do more stuff on the same old hardware, this resistance-to-change is opening up somewhat.

Also, the PS3's Cell is completely different to a desktop CPU. It's a NUMA design where you basically have to choose which tiny portion of main RAM you're going to move down to each core's "local RAM".
If you do something like call a virtual function, which has to go fetch the v-table from god-knows-where in RAM, it's not going to be very happy with you ;)
On the other hand, a data-flow approach, which is designed around piping bundles of data around the place will work quite nicely on this kind of hardware.
Quote:
Original post by BrianLarsen
It does sound a lot like a fusion of structural and oo techniques....Perhaps the oo guys could not stomach admitting that structured programming does in fact have its own merits and they renamed it to avoid the problem.
Hehe, yeah it's a bit like that - an admission that 'pure' OO with beautiful normalised data designs don't necessarily perform very well.
I would say though that data-oriented design doesn't necessarily rely on or conflict with the structured/procedural or object-oriented programming paradigms -- I think it can be fused with either of them.
You could implement a data-oriented design using an OO paradigm (e.g. a data-flow program written in C++), just like you can implement an object-oriented design using a procedural paradigm (e.g. an OO design written in C).

[Edited by - Hodgman on June 26, 2010 6:17:03 AM]

Share this post


Link to post
Share on other sites
I see I see. Great post imo.

So it could be described as a design methodology that aims to maximise memory access efficiency and task parallelisability...

It seems though that this is a case of compromising on the elegance and ease of OO design in favour of the more efficient DO design. You might still incorprate OOP but with DOP ontop of that you will have to also think about the way the computer would prefer to access the memory, and about how well you can parallelise your tasks.

It does seem though that 100% pure DOP would be for example separate contiguous vectors for absolutely every type of data along with specific functions for every task...

But newbies are discouraged from doing that (even though it comes naturally to them I think as it did for me). The reason they are discouraged from doing that is because unless you make every vector a global, your code is going to be a nightmare to maintain and update.

So then, for me the question would be, what's so bad about making every item of data a global dynamic vector in organised headers, with respective functions. And your actual main() consisting of only function calls to manipulate the data as you see fit? Such a program would be easy to maintain, with elegant and easy design, extremely speed efficient... No?

BTW, thank you every one for your replies! :)

Share this post


Link to post
Share on other sites
Hmmm, I reread the articles and actually understood them this time. But they are very handwavy and I haven't been able to think of a proper way of actually implementing them yet. (After some testing I've realised the latter approach suggested in my last post certainly isn't a proper way.)

Perhaps this is a bit beyond me at the moment. I think I'll concentrate on mastering OOP first.

Share this post


Link to post
Share on other sites
Quote:
Original post by N3wbie
Hmmm, I reread the articles and actually understood them this time. But they are very handwavy and I haven't been able to think of a proper way of actually implementing them yet.


Yes, I find them pretty vague too. I need to see a practical implementation to understand the paradigm. There isn't much about the subject for the search terms I used though, does anyone know / have some example code that implements a data-driven design?

It looks like something I *might* implement, I namely have a lot of loops like:

for (iterator over vector)
{
if (iterator->active())
{
iterator->do();
}
}

Share this post


Link to post
Share on other sites
Quote:
Original post by Decrius
Yes, I find them pretty vague too. I need to see a practical implementation to understand the paradigm. There isn't much about the subject for the search terms I used though, does anyone know / have some example code that implements a data-driven design?

I would like to third this sentiment.

I'm guessing for your code it would be more like:

for (iterator over vectorOfActiveStuff)
{
iterator->do();
}


But that is completely a shot in the dark.

Share this post


Link to post
Share on other sites
The problem is this kind of thing does depend very much on what you are trying to do.

The first thing to realise is that OOP and DOP are NOT mutually exclusive things to use in a program. By adopting OOP for parts you don't have to ignore DOP for other parts and vice-versa.

Part of the reason for DOP is to consider the flow of data, what you need to access at any given time vs cache usage of that data. The idea is to make sure you are doing as much useful work as possible without waiting for the CPU to fetch data.

The already linked Pitfalls of OOP slides show this best of all, in perticular the latter stages which talk about vtables and their effect on cache usage.

The key to thinking about this however is to think about what data you need to touch at what times and what other data it is pulling along with it which you aren't going to touch.

As an example I'm going to use a particle system simply because its something I've been looking at of late and a common thing people want to create.

A typical OOP particle system might have two principle classes;
- Emitter
- Particle

Now, the emitter class will have an array/vector of 'alive' particles which it uses to run the simulation.

Your particle class might be structured something like this;

struct Particle
{
float x, y;
float age;
float maxAge; // so not all particles have the same age
float size;
float u,v;
unsigned char colour[4];
}



So, your simple particle stucture is 4*7 + 4 bytes in size or 32bytes. For reference L1 cache on a Core i7 920 is 64bytes per cache line, 32KBytes in total; this means we can fit 2 particles per cache line to work on.

Now, it comes to your update and you plan on looping over all the particles, increasing their age, checking to see if its greater than the maxAge and updating the locations if not.

So, you access the first particle, this triggers a load of that particle data + next particle into the cache as the CPU loads in cache lines.

Best case;
- you access the age, add a value to it, check it against the maxAge, find its alive still, write the new age back and update the x,y location

This operation touches 16bytes of a 32byte structure; this means half the data loaded wasn't required for this operation and half of our cache was wasted right away.

At which point you can process the next particle, same deal, same wastage. Now, if you are lucky the CPUs prefetcher will have decided to fetch the next couple of cache lines 'just in case' so you won't pay a 100+ cycles waiting for the next two particles to load from ram, into cache/cpu so you can work on them.

However, due to data wastage you end up wasting upto 16KByte of cache due to loading structures which have, at that point, redundant data in them.

At this point you can look at your data and split it up into related but seperate structures in their own memory space.
At the simplest level you could split the above data like this;

struct ParticleLifeData
{
float Age;
float maxAge;
}
// 8 bytes
struct ParticleLocation
{
float x,y;
}
// 8 bytes
struct ParticleMaterial
{
float u,v;
unsigned char colour[4];
}
// 12 bytes


This data would be stored in 3 seperate arrays so they are still cache friendly but wont overlap each other.

Then when you perform your update loop the following happen (assuming 'best case' as before);
- Load the first particle, which causes the whole cache line to be loaded, this loads 8 sets of particle life detail in
- Adjust the age, check it against the maxAge and write back as required
- Load the first particles location, which loads another cache line which loads in 8 sets of (x,y) to work with
- update data
- move onto next particle

You can now process 8 particles before you trigger a cache miss AND you are using 100% of each cache line while processing the data.

All of which could well improve performance.

Note, this is by no means a complete investigation into the problem, you can decompose the data further and apply other optimisations, this was just designed to give you an idea of how to go about it.

It should also be noted that, for many many people doing many games this simply isn't a problem. You won't produce enough data to tax the CPU anyway, or you'll hit other bottle necks, or the out of order processing of modern desktop CPUs will hide latency issues.

Don't get me wrong, understanding at this level is worth doing, however don't go so deep into the rabbit hole that you decompose everything because in many cases it'll hurt code clarity more than it'll help performance.

All things in moderation, right tool for the right job and all that.

OOP, DOP and Functional programming can all co-exist just fine, never forget that.

Share this post


Link to post
Share on other sites
An important Data Oriented Design concept is the pre-processing of data. Don't do at game time the things that can be done at load time.

It's something to consider throughout the entire lifespan of any data object. I mean, if every time you wanted to use a texture, it would be pretty crazy to start clear back at unpacking maybe a compressed zip file sitting on a USB memory card containing that texture, load it into your game, pre-process the image, send the texture to your graphics library, and then finally use it. But there can be subtle little performance hits lurking in unexpected places along that chain of functionality. Someone interested in Data Oriented Design whips out a Sherlock Holmes pipe and looks closely at every detail.

And maybe it's already been mentioned, but pointers/dereferencing is also important when it comes to Data Oriented Design. For example, in a function call the cpu has less to do when you pass only a pointer to a block of data, rather than the entire block of data.

Passing full structures through functions is a very common performance error, especially (and ironically) for tiny little structures and tiny little functions, such as vector normalization or scaling.


Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement