What is Data Oriented Programming?
#1 Members - Reputation: 100
Posted 24 June 2010 - 05:48 PM
What is Data Oriented Programming?
#2 Members - Reputation: 391
Posted 24 June 2010 - 07:41 PM
The information I gleaned from my 5 minute gloss-over is that Data Oriented Programming allows you to de-couple tightly-bound architecture into one that is 'driven' by data rather than explicit and static interfaces definitions. Think dependency injection with data rather than, e.g. static spring files. It's not a replacement for OOP, it's at a higher level than that.
On page 10 they have a diagram of a producer and a consumer with the interface between being a data model rather than an explicit definition. This implies loose coupling between components which would make bespoking the architecture far easier and within the reach of the end user rather at the initial design stage.
It's an interesting concept from a business enterprise application perspective but I'm not sure it would suit game development unless it was perhaps some kind of game designer application.
#3 Members - Reputation: 1519
Posted 24 June 2010 - 07:51 PM
This
In a very similar fashion to optimizing code for SSE instructions, you try to bunch used data close together.
Large structures that conglomerate data together result in processes that have signficantly low data usage rates.
Consider any class with a "bool needsUpdate" member. In a loop like:
for ( all creatures )
if ( creature->needsUpdate )
creature->Update();
You end up reading in 128bytes of cache line to read a single bit, and if it is false, you discard all 128bytes you just read in.
This guy goes through a lot more of it focusing on a component archetecture, and explains why breaking up your data into more focused components can make things much faster.
#4 Moderators - Reputation: 13602
Posted 24 June 2010 - 08:27 PM
Quote:
the point of any program is simply to transform data from one form into another and nothing else. And as one "solution" which ignores the real core problems of development is developed and others over time are built on top of that idea, and so on, we're left with systems that are over-designed, perform poorly and simply do not accomplish what they intended to in the first place - and certainly not well. I continue to suggest that we all take a step back from what we're doing and the methods we're using to solve problems and try to remember what the real issues are.
Check out pitfalls of OOP for an example where designing around data (instead of designing around code) can give big performance benefits.
#5 Members - Reputation: 1084
Posted 24 June 2010 - 08:47 PM
To me, it looks like the author ends up with a component-based design, where each component type is stored in and processed by an appropriate system. Renderables are processed en-masse by the rendering system, physical objects are stored in and processed by the physics engine, and so on. Game objects only hold references to the components they consist of: for example, an enemy would reference a renderable, physical and AI component, while a trigger would only have a physical and a script/action component.
I never looked at it from a performance angle, but it makes sense.
#7 Members - Reputation: 126
Posted 24 June 2010 - 09:48 PM
Quote:
Original post by N3wbie
Do you still use classes in DOP?
Is DOP a different way of organising and constructing your classes with a view to making your code more efficient?
In general your applications will mix OOP and DOP. DOP is more for dealing with large quantities of data and efficiently working with them, where OOP works well for everything else.
I only use DOP in specific classes or sub systems, later in the cycle of an app one may find it appropriate to use for optimization passes too.
KulSeran's example is the typical one most developers run into. The larger the class is you are looping through the worse your data access is. Cache misses can be detrimental to apps trying to be efficient and DOP takes into consideration the layout of your data and how the system actually accesses it.
#8 Members - Reputation: 1519
Posted 24 June 2010 - 11:07 PM
Quote:
Original post by Hodgman
To quote Mike Acton:Quote:
the point of any program is simply to transform data from one form into another and nothing else. And as one "solution" which ignores the real core problems of development is developed and others over time are built on top of that idea, and so on, we're left with systems that are over-designed, perform poorly and simply do not accomplish what they intended to in the first place - and certainly not well. I continue to suggest that we all take a step back from what we're doing and the methods we're using to solve problems and try to remember what the real issues are.
Check out pitfalls of OOP for an example where designing around data (instead of designing around code) can give big performance benefits.
My god! That is the actual article I was searching for and didn't find when I was posting. Thanks.
#9 Members - Reputation: 409
Posted 24 June 2010 - 11:56 PM
http://bitsquid.blogspot.com/2010/05/practical-examples-in-data-oriented.html
#10 Members - Reputation: 100
Posted 25 June 2010 - 02:27 AM
1. Organize your code in terms of data structures.
2. Use vectors to store your data structures wherever possible.
So if I have a simulation of a solar system, I might have a vector of sizes, a vector of masses, a vector of planet types, a matrix of positions, etc?
What about if I wanted to model the orbits using a Newtonian system? I guess I would have a matrix representing the graviational field acting on each massive item in the solar system?
And what about the relationship between the field and the position over time? Would this be a function or as I think I read somewhere a matrix transformation representing this relationship?
Would this be preferable to OOP? Isn't this just good procedural coding? I mean with a bit of thought that's probably the approach I would have taken anyway if I weren't using OOP. In fact, though I can see the massive benefits of classes, I find the restrictions they have too restrictive making coding more of a chore, and that you spend too much time worrying about them and the language syntax. I think I would likely prefer to code the way I just described.
[Edited by - N3wbie on June 25, 2010 10:27:24 AM]
#12 Members - Reputation: 633
Posted 25 June 2010 - 06:05 AM
Quote:
Original post by N3wbie
Hmmm, so would I be right in saying that 2 of the key concpets are:
1. Organize your code in terms of data structures.
2. Use vectors to store your data structures wherever possible.
So if I have a simulation of a solar system, I might have a vector of sizes, a vector of masses, a vector of planet types, a matrix of positions, etc?
from what I've read so far on the T=Machine thing posted earlier, I don't think so. I'm not finished yet, so I'll come back and update on whether I still think that with more detail.
#13 Members - Reputation: 145
Posted 25 June 2010 - 06:52 AM
Quote:
1. Organize your code in terms of data structures.
2. Use vectors to store your data structures wherever possible.
I would say that (1) is inaccurate, I would say you organise your code in terms of data access patterns.
Also (2) only really works if you know the maximum number up front. If you use a vector without preallocating then all your objects will be invalidated at some point. I guess you could store indices in your objects instead of pointers.
I think people generally find OOP easier because it often fits the mental model of the problem they are trying to solve. Take the example you have just given, you might start by thinking about the solar system, then you see that it is made up of Planets which all share certain identifiable properties, position, mass, etc. BANG! There you have your data structure!
The problem with this approach is that it doesn't fit so well with the memory access model we have in most modern machines. It takes a second step to take your conceptual objects and convert them into the best format for the computer to deal with. It is a different way of thinking particuarly if you are used to following OOP principles.
Also I'm not sure the DOP paradigm outlined in the paper is the same as the approach outlined by Tony Albrecht and others for optimising memory access. There is crossover but the scope is quite different.
#14 Members - Reputation: 100
Posted 25 June 2010 - 07:43 PM
[Edited by - N3wbie on June 26, 2010 2:43:52 AM]
#15 Moderators - Reputation: 13602
Posted 26 June 2010 - 12:17 AM
Quote:Yeah I'd agree with your correction. A lot of the time, if you organise for data-access, you might end up using vectors... But, treating 'use vectors' as a rule is learning the symptom instead of learning the rationale ;)
Original post by Noggs Quote:I would say that (1) is inaccurate, I would say you organise your code in terms of data access patterns.
1. Organize your code in terms of data structures.
2. Use vectors to store your data structures wherever possible.
Quote:Yeah, object-modelling techniques (OOP) are very popular these days.
I think people generally find OOP easier because it often fits the mental model of the problem they are trying to solve. Take the example you have just given, you might start by thinking about the solar system, then you see that it is made up of Planets which all share certain identifiable properties, position, mass, etc. BANG! There you have your data structure!
In a data-oriented view, you'd take your OO design, and inspect how related the members are concerning data access. Look at your program in terms of a collection of Input-Process-Ouput boxes, then look at which member variables different 'Proccess' parts of those boxes use.
e.g. Maybe your planet class has lots of static data, like atmosphere colour, earth-years-per-rotation, main elemental composition, etc... However, your 'Do Physics' process only needs to know velocity, position and mass. In this case, bundling all that other data into the same 'object' is just going to pollute the cache and hurt performance.
Maybe you'll make one array of PlanetPhysicsData, and another array of PlanetSupplementaryData. Another option would be to keep the physics data in the planet class, but move the supplementary data out into another class, and have the planet keep a pointer to it (then the you're only polluting the cache with a single pointer, instead of loads of unneeded data).
Often if you redesign for optimal data access patterns, you may even end up duplicating some fields.
e.g. if physics and grapics 'processes' both need to know the position, you could make an array of just positions. Alternatively, you could give the PlanetGraphical and PlanetPhysical classes their own cached copy of the position. Once all the PlanetPhysical objects have been updated, their positions are copied over into the PlanetGraphical array. This duplication also increases your options for parallelism, because you've got less shared data between 'processes'.
Quote:You'll probably find a lot more threading articles that deal with mutual-exclusion than message-passing too, even though message-passing is almost as old and generally a better idea for high-level parallelism. People are resistant to change unless they're forced into it.
Original post by N3wbie
...but by the looks of the publishing date, 1980, its not a new concept at all. It seems strange that there are a fair few articles arguing its superiority over OOP but not much else than that. If its so good and so old an idea, why isnt there more material on it? Perhaps because it never became the standard like OOP did? Why not?
The current generation of consoles are providing the push towards thinking about data-access patterns, because, as shown in the 'pitfalls of OOP' earlier in this thread, relative to CPU speed memory has become 1000 times slower than it used to be.
If you profile nice, 'well designed' OO code on a modern in-order CPU (the CPU's in consoles don't rearrange instructions to hide memory latency for you) then the profiler might not show any bottlenecks, but you notice that it's just kind of slow across the board. Looking closer, you can see you've got tens of thousands of cache misses across the board, each one idling the CPU for a substantial amount of time (anywhere from a dozen to a thousand cycles).
With consoles being 5 years old now, and new games trying to do more stuff on the same old hardware, this resistance-to-change is opening up somewhat.
Also, the PS3's Cell is completely different to a desktop CPU. It's a NUMA design where you basically have to choose which tiny portion of main RAM you're going to move down to each core's "local RAM".
If you do something like call a virtual function, which has to go fetch the v-table from god-knows-where in RAM, it's not going to be very happy with you ;)
On the other hand, a data-flow approach, which is designed around piping bundles of data around the place will work quite nicely on this kind of hardware.
Quote:Hehe, yeah it's a bit like that - an admission that 'pure' OO with beautiful normalised data designs don't necessarily perform very well.
Original post by BrianLarsen
It does sound a lot like a fusion of structural and oo techniques....Perhaps the oo guys could not stomach admitting that structured programming does in fact have its own merits and they renamed it to avoid the problem.
I would say though that data-oriented design doesn't necessarily rely on or conflict with the structured/procedural or object-oriented programming paradigms -- I think it can be fused with either of them.
You could implement a data-oriented design using an OO paradigm (e.g. a data-flow program written in C++), just like you can implement an object-oriented design using a procedural paradigm (e.g. an OO design written in C).
[Edited by - Hodgman on June 26, 2010 6:17:03 AM]
#16 Members - Reputation: 100
Posted 26 June 2010 - 01:32 AM
So it could be described as a design methodology that aims to maximise memory access efficiency and task parallelisability...
It seems though that this is a case of compromising on the elegance and ease of OO design in favour of the more efficient DO design. You might still incorprate OOP but with DOP ontop of that you will have to also think about the way the computer would prefer to access the memory, and about how well you can parallelise your tasks.
It does seem though that 100% pure DOP would be for example separate contiguous vectors for absolutely every type of data along with specific functions for every task...
But newbies are discouraged from doing that (even though it comes naturally to them I think as it did for me). The reason they are discouraged from doing that is because unless you make every vector a global, your code is going to be a nightmare to maintain and update.
So then, for me the question would be, what's so bad about making every item of data a global dynamic vector in organised headers, with respective functions. And your actual main() consisting of only function calls to manipulate the data as you see fit? Such a program would be easy to maintain, with elegant and easy design, extremely speed efficient... No?
BTW, thank you every one for your replies! :)
#17 Members - Reputation: 100
Posted 26 June 2010 - 08:16 AM
Perhaps this is a bit beyond me at the moment. I think I'll concentrate on mastering OOP first.
#18 Members - Reputation: 100
Posted 26 June 2010 - 09:23 AM
Quote:
Original post by N3wbie
Hmmm, I reread the articles and actually understood them this time. But they are very handwavy and I haven't been able to think of a proper way of actually implementing them yet.
Yes, I find them pretty vague too. I need to see a practical implementation to understand the paradigm. There isn't much about the subject for the search terms I used though, does anyone know / have some example code that implements a data-driven design?
It looks like something I *might* implement, I namely have a lot of loops like:
for (iterator over vector)
{
if (iterator->active())
{
iterator->do();
}
}
#19 Members - Reputation: 633
Posted 26 June 2010 - 11:12 AM
Quote:
Original post by Decrius
Yes, I find them pretty vague too. I need to see a practical implementation to understand the paradigm. There isn't much about the subject for the search terms I used though, does anyone know / have some example code that implements a data-driven design?
I would like to third this sentiment.
I'm guessing for your code it would be more like:
for (iterator over vectorOfActiveStuff)
{
iterator->do();
}
But that is completely a shot in the dark.
#20 Moderators - Reputation: 3974
Posted 26 June 2010 - 12:49 PM
The first thing to realise is that OOP and DOP are NOT mutually exclusive things to use in a program. By adopting OOP for parts you don't have to ignore DOP for other parts and vice-versa.
Part of the reason for DOP is to consider the flow of data, what you need to access at any given time vs cache usage of that data. The idea is to make sure you are doing as much useful work as possible without waiting for the CPU to fetch data.
The already linked Pitfalls of OOP slides show this best of all, in perticular the latter stages which talk about vtables and their effect on cache usage.
The key to thinking about this however is to think about what data you need to touch at what times and what other data it is pulling along with it which you aren't going to touch.
As an example I'm going to use a particle system simply because its something I've been looking at of late and a common thing people want to create.
A typical OOP particle system might have two principle classes;
- Emitter
- Particle
Now, the emitter class will have an array/vector of 'alive' particles which it uses to run the simulation.
Your particle class might be structured something like this;
struct Particle
{
float x, y;
float age;
float maxAge; // so not all particles have the same age
float size;
float u,v;
unsigned char colour[4];
}
So, your simple particle stucture is 4*7 + 4 bytes in size or 32bytes. For reference L1 cache on a Core i7 920 is 64bytes per cache line, 32KBytes in total; this means we can fit 2 particles per cache line to work on.
Now, it comes to your update and you plan on looping over all the particles, increasing their age, checking to see if its greater than the maxAge and updating the locations if not.
So, you access the first particle, this triggers a load of that particle data + next particle into the cache as the CPU loads in cache lines.
Best case;
- you access the age, add a value to it, check it against the maxAge, find its alive still, write the new age back and update the x,y location
This operation touches 16bytes of a 32byte structure; this means half the data loaded wasn't required for this operation and half of our cache was wasted right away.
At which point you can process the next particle, same deal, same wastage. Now, if you are lucky the CPUs prefetcher will have decided to fetch the next couple of cache lines 'just in case' so you won't pay a 100+ cycles waiting for the next two particles to load from ram, into cache/cpu so you can work on them.
However, due to data wastage you end up wasting upto 16KByte of cache due to loading structures which have, at that point, redundant data in them.
At this point you can look at your data and split it up into related but seperate structures in their own memory space.
At the simplest level you could split the above data like this;
struct ParticleLifeData
{
float Age;
float maxAge;
}
// 8 bytes
struct ParticleLocation
{
float x,y;
}
// 8 bytes
struct ParticleMaterial
{
float u,v;
unsigned char colour[4];
}
// 12 bytes
This data would be stored in 3 seperate arrays so they are still cache friendly but wont overlap each other.
Then when you perform your update loop the following happen (assuming 'best case' as before);
- Load the first particle, which causes the whole cache line to be loaded, this loads 8 sets of particle life detail in
- Adjust the age, check it against the maxAge and write back as required
- Load the first particles location, which loads another cache line which loads in 8 sets of (x,y) to work with
- update data
- move onto next particle
You can now process 8 particles before you trigger a cache miss AND you are using 100% of each cache line while processing the data.
All of which could well improve performance.
Note, this is by no means a complete investigation into the problem, you can decompose the data further and apply other optimisations, this was just designed to give you an idea of how to go about it.
It should also be noted that, for many many people doing many games this simply isn't a problem. You won't produce enough data to tax the CPU anyway, or you'll hit other bottle necks, or the out of order processing of modern desktop CPUs will hide latency issues.
Don't get me wrong, understanding at this level is worth doing, however don't go so deep into the rabbit hole that you decompose everything because in many cases it'll hurt code clarity more than it'll help performance.
All things in moderation, right tool for the right job and all that.
OOP, DOP and Functional programming can all co-exist just fine, never forget that.






