Jump to content

  • Log In with Google      Sign In   
  • Create Account

Not dead...

More Particles

Posted by , 02 April 2010 - - - - - - · 238 views


Today was a bank holiday.
So I decided that as I'm not going to be about the rest of the weekend it would be a good time to do "bad things" [grin]

As I've mentioned before now I've been (on and off) working on a particle system type thing, but more 'off' than 'on' of late. Today I decided to crack on and get some stuff done on it.

I already had a base of some code which I had copied from the previous test bed app, so basic functions outlined however it was lacking substance. I decided to start with a structure called ParticleInformation which is a key structure in that it owns the pointers to all the data the particle system will run.

struct ParticleInfomation
float * x;
float * y;
float * scale;
float * momentumX;
float * momentumY;
float * velocityX;
float * velocityY;
float * age;
float * colourR;
float * colourG;
float * colourB;
float * colourA;
float * rotation;

This is how it started life today; a collection of pointers for each component. At this point you might be wondering why I'd do that and not just have something which looked more like this;

struct ParticleInfomation
vector2 position;
vector2 momentum;
vector2 velocity;
vector4 colour;
float scale;
float rotation;
float age;

The reason is simple; for particle processing this structure is about as bad as you can get CPU and cache wise. On a Core i7 920 each cache line is 64bytes in size with each core having 32Kb of L1 cache and 256KByte of L2 cache (which is shared per thread). That structure above, assuming it is packed properly, would take up 52bytpes on its own which would totally blow your L1 cache and you'd soon fill up L2 as well given that you can't even fit 5 of them in there; and given that memory latency is a problem the cache misses are going to add up and stall you nicely (a round trip to ram could take 600 odd clock cycles, which is costly). So processing one single particle is going to start doing very bad things to your cache pretty quickly and rules out the use of SSE as you can't get decent alignment and you've have to swizzle the values about to process them.

The structure I'm using however is setup in such a way that each component has its own block of memory which you can align on 16bytes meaning we are nicely setup for SSE right away. When it comes to cache usage things are a bit nicer as well; the algorithm in question for updating the particles can work on separate sets of data at a time. So for updating the ages we can just blast though the age data in one chunk and process 4 ages at once. So 16 bytes can be pulled in, worked on and then flushed back and with the aggressive pre-fetching on modern CPUs this should pipeline nicely and we can do more work per cycle.

Of course, the structure started life like that, then I got into the issues of life times and copying around as well as data creation and release and it soon became this;

struct ParticleInfomation
float * x;
float * y;
float * scale;
float * momentumX;
float * momentumY;
float * velocityX;
float * velocityY;
float * age;
float * colourR;
float * colourG;
float * colourB;
float * colourA;
float * rotation;

volatile long * refCount;
void IncRefCount();
void DecRefCount();
ParticleInfomation(const ParticleInfomation &rhs) ;
ParticleInfomation(ParticleInfomation &&rhs) ;

void ReserveMemory(int size);

The particle information now maintains the data itself (ok, the pointers are still public but no one besides the particle system ever sees them and direct access is needed so that's fine), ensuring copying and moving (ParticleInfomation(ParticleInfomation &&rhs)) happen sanely.

Most functions work as expected so I'll not bother with them, I will touch briefly on my initial move support and detor.

ParticleInfomation::ParticleInfomation(ParticleInfomation &&rhs)
: x(rhs.x), y(rhs.y), scale(rhs.scale), momentumX(rhs.momentumX), momentumY(rhs.momentumY), velocityX(rhs.velocityX), velocityY(rhs.velocityY), age(rhs.age),
colourR(rhs.colourR), colourG(rhs.colourG), colourB(rhs.colourB), colourA(rhs.colourA), rotation(rhs.rotation), refCount(rhs.refCount)
rhs.x = NULL;
rhs.y = NULL;
rhs.scale = NULL;
rhs.momentumX = NULL;
rhs.momentumY = NULL;
rhs.velocityX = NULL;
rhs.velocityY = NULL;
rhs.age = NULL;
rhs.colourR = NULL;
rhs.colourG = NULL;
rhs.colourB = NULL;
rhs.colourA = NULL;
rhs.rotation = NULL;
rhs.refCount = NULL;

This is the 'move' constructor; it transfers ownership from one ParticleInfomation instance to another. The main thing to note is that while it is setup as a normal copy-ctor with the initialiser list the right hand side's values are then nulled out, including the pointer to the refCount which is a pointer to an aligned chunk of memory which can be shared between copied instances. (Ok, so I've kind of recreated a smart pointer here, just a limited subset thereof).

When the class is copied or constructed it increments a reference count and when its destroyed it decrements one via these two functions;

void ParticleInfomation::IncRefCount()

void ParticleInfomation::DecRefCount()

However, as you can see from the above code if ownership is transferred via a move operation the refCount variable has been nulled so the above is likely to explode; so we have to be a little bit careful in the dtor;

if(refCount != NULL)
if(*refCount == 0)

It probably isn't' completely 100% safe as yet, but it should do for now [grin]

With that out of the road I decided to move on to the first pass of the update logic.
I decided to split this over 3 functions;

- PreUpdate() - this would handle any new 'trigger' calls and setup blocks of particles and then figure out how many update 'blocks' we want to dispatch
- Update() - which is designed to be called from multiple threads to work on blocks of particles
- PostUpdate() - which will mainly remove any 'dead' particles and any other work required

At the time of writing PreUpdate is only partly done as I'm as yet to decide on how I want to structure the work requests for the update loop. It'll be some interface but I need to write that first [grin]

It does however have some basic code for emitting based on trigger requests;

void ParticleEmitter::PreUpdate(float deltaTime)
// Services any required trigger calls
const EmitterPosition & position = queuedTriggers.front();
if(details.maxParticleCount > usedParticles)
usedParticles += EmitParticles(position);

// then work out block sizes and schedule those blocks for updates
//TODO: figure out what kind of task interface this is going to use


Pretty simple; for each request fire off some particles if we have the space. This is only a first pass because in the end I'd like two classes of particle systems;
- one shot, which is basically what this is
- constantly spawning, which needs slightly different PreUpdate logic however I figure it best to get the one shot working first and then refactor

The EmitParticles function does the main chunk of the work; again this is a first pass system as currently each particle spawns at a single point. Ideally this will be customisable via an affecter later on.

int ParticleEmitter::EmitParticles(const EmitterPosition &position)
int realReleaseAmount = (usedParticles + details.releaseAmount > details.maxParticleCount ) ? details.maxParticleCount - usedParticles : details.releaseAmount;

SetParticleData(realReleaseAmount, 0.0f, particleData.momentumX + usedParticles);
SetParticleData(realReleaseAmount, 0.0f, particleData.momentumY + usedParticles);

SetParticleData(realReleaseAmount, position.x, particleData.x + usedParticles);
SetParticleData(realReleaseAmount, position.y, particleData.y + usedParticles);
SetParticleData(realReleaseAmount, details.maxLife, particleData.age + usedParticles);
SetParticleData(realReleaseAmount, details.scale, particleData.scale + usedParticles);
SetParticleData(realReleaseAmount, details.colour.r, particleData.colourR + usedParticles);
SetParticleData(realReleaseAmount, details.colour.g, particleData.colourG + usedParticles);
SetParticleData(realReleaseAmount, details.colour.b, particleData.colourB + usedParticles);
SetParticleData(realReleaseAmount, details.colour.a, particleData.colourA + usedParticles);

GenerateForces(realReleaseAmount, particleData.velocityX + usedParticles, particleData.velocityY + usedParticles);

return realReleaseAmount;

void ParticleEmitter::SetParticleData(int amount, float value, float * location)
std::fill_n(location, amount, value);

void ParticleEmitter::GenerateForces(int amount, float *locationX, float *locationY)
std::generate_n(locationX, amount, [this]() { return sin((float)this->rndNumGen()) * this->details.releaseSpeed; });
std::generate_n(locationY, amount, [this]() { return cos((float)this->rndNumGen()) * this->details.releaseSpeed; });

My love of lambda functions and the C++ Std. Lib can be seen on display here, mainly in the 'GenerateForces' function where a lambda is used to generate some velocities for release.

The Update function follows next, and it looks slightly simple....

void ParticleEmitter::Update(float deltaTime, int blockStart, int blockEnd)
// Update a chunk of particles here
int len = blockEnd - blockStart;
int remain = len % 8;
int offset = blockEnd - remain;

// Deal with age subtraction first
UpdateAges(deltaTime, offset, remain);
// Next momentum += velocity; pos += momentum * deltaTime;
UpdatePositionsAndMomentums(len, deltaTime);
UpdateColours(len, deltaTime);
UpdateRotations(len, deltaTime);

As mentioned this is designed to work on blocks of particles, which are always multiples of 4 but could be multiples of 8; remain is only used in the UpdateAges function as it is simple enough that I don't find myself running into SSE register issues on x86. If I do x64 versions of the other functions they will probably end up working in much the same way.

UpdateAges itself is quite simple;

void ParticleEmitter::UpdateAges( float deltaTime, int offset, int remain )
__m128 time = _mm_load1_ps(&deltaTime);

for(int i = 0; i < offset; i+= 8)
__m128 ages = _mm_load_ps(particleData.age + i);
__m128 updatedAges = _mm_sub_ps(ages, time);
__m128 ages2 = _mm_load_ps(particleData.age + i + 4);
__m128 updatedAges2 = _mm_sub_ps(ages, time);

_mm_stream_ps(particleData.age + i,updatedAges);
_mm_stream_ps(particleData.age + i + 4,updatedAges2);

__m128 ages = _mm_load_ps(particleData.age + offset);
__m128 updatedAges = _mm_sub_ps(ages, time);
_mm_stream_ps(particleData.age + offset,updatedAges);

For each age in the particle system we simply read it in, subtract the deltaTime and stream it back out again. It might be worth modifying this function so that each read is half the range away to let us use the cache a bit better as currently it would consume a whole cache line on each loop.

UpdatePositionsAndMomentums is by far the most complex of these functions;

void ParticleEmitter::UpdatePositionsAndMomentums( int len, float deltaTime )
for(int i = 0; i < len; i+= 4)
__m128 posX = _mm_load_ps(particleData.x + i);
__m128 posY = _mm_load_ps(particleData.y + i);
__m128 velX = _mm_load_ps(particleData.velocityX + i);
__m128 velY = _mm_load_ps(particleData.velocityY + i);
ParticlePosition position(posX, posY);
std::for_each(details.positionModFuncs.begin(), details.positionModFuncs.end(), [&](particlePositionModifierFunc &func)
particleForce force = func(position, this->position, deltaTime);
__m128 forceX = _mm_loadu_ps(force.x);
velX = _mm_add_ps(velX, forceX);
__m128 forceY = _mm_loadu_ps(force.y);
velY = _mm_add_ps(velY, forceY);

__m128 momentumsX = _mm_load_ps(particleData.momentumX + i);
__m128 momentumsY = _mm_load_ps(particleData.momentumY + i);

momentumsX = _mm_add_ps(momentumsX, velX);
momentumsY = _mm_add_ps(momentumsY, velY);
velX = _mm_setzero_ps();
velY = _mm_setzero_ps();

// store Velocity and momentum and reload position which should still live in cache

__m128 time = _mm_load1_ps(&deltaTime);
_mm_stream_ps(particleData.velocityX + i,velX);
_mm_stream_ps(particleData.velocityY + i,velY);
_mm_stream_ps(particleData.momentumX + i,momentumsX);
_mm_stream_ps(particleData.momentumY + i,momentumsY);

momentumsX = _mm_mul_ps(momentumsX, time);
momentumsY = _mm_mul_ps(momentumsY, time);

posX = _mm_add_ps(momentumsX, posX);
posY = _mm_add_ps(momentumsY, posY);

_mm_stream_ps(particleData.x + i, posX);
_mm_stream_ps(particleData.y + i, posY);

In theory we have enough registers here but it does somewhat assume the compiler can be smart about it.
The process is simple enough;
- read in positions and velocities
- give affecters a chance to adjust the velocities in some manner (more lambda trickery there)
- apply those velocities to the momentum of the particle
- update the position based on time

It only looks complex because we are doing 4 particles at a time.
I'm considering doing some benchmarking on this with regards to the 'std::for_each' placement. There is a chance that moving it out of the loop and having it just update the velocities for each particle before doing the rest of the loop might be better for it but without testing this method will do for now.

The UpdateColours and UpdateRotations functions are also pretty simple;

void ParticleEmitter::UpdateColours( int len, float deltaTime )
for(int i = 0; i < len; i+= 4)
__m128 age = _mm_load_ps(particleData.age + i);
__m128 red = _mm_load_ps(particleData.colourR + i);
__m128 green = _mm_load_ps(particleData.colourG + i);
__m128 blue = _mm_load_ps(particleData.colourB + i);
__m128 alpha = _mm_load_ps(particleData.colourA + i);

std::for_each(details.colourModFuncs.begin(), details.colourModFuncs.end(), [&](particleColourModifierFunc &func)
Age ages(age);
ParticleColours colours(red, green, blue, alpha);
func(colours, ages, deltaTime);

_mm_stream_ps(particleData.colourR + i, red);
_mm_stream_ps(particleData.colourG + i, green);
_mm_stream_ps(particleData.colourB + i, blue);
_mm_stream_ps(particleData.colourA + i, alpha);

void ParticleEmitter::UpdateRotations( int len, float deltaTime )
for(int i = 0; i < len; i+= 4)
__m128 rotation = _mm_load_ps(particleData.rotation + i);
std::for_each(details.rotationModFuncs.begin(), details.rotationModFuncs.end(), [&rotation, &deltaTime](particleRotationModiferFunc &func)
particleRotations rot = func(deltaTime);
__m128 accumRotations = _mm_loadu_ps(rot.rotation);
rotation = _mm_add_ps(rotation, accumRotations);

_mm_stream_ps(particleData.rotation + i, rotation);

Simple read in, modify and write out affairs with no shocks to be had.

Which brings us to the final function of the day; PostUpdate.
This little gem is needed to retire dead particles. In my original test bed the method used as naïve to say the least.

int idx = 0;
while(idx < usedParticles)
if(particle_age[idx] >= particleLife)
particle_age[idx] = particle_age[usedParticles - 1];
particle_x[idx] = particle_x[usedParticles - 1];
particle_y[idx] = particle_y[usedParticles - 1];
particle_scale[idx] = particle_scale[usedParticles - 1];
particle_momentum_x[idx] = particle_momentum_x[usedParticles - 1];
particle_momentum_y[idx] = particle_momentum_y[usedParticles - 1];
particle_velocity_x[idx] = particle_velocity_x[usedParticles - 1];
particle_velocity_y[idx] = particle_velocity_y[usedParticles - 1];

While simple enough and employing the old 'copy back' method of particle swapping it simply isn't very fast or friendly.

We move forward though the particles and, when we find a dead one, a particle from the back is copied into its place, usedParticles is decreased and that particle is tested. The problem is we end up wasting time and resources.

Firstly, we know we can copy in blocks of 4 because we enforce a policy of 'at least 4 particles are alive' to keep the SSE sane.
Secondly, this means that blocks of particles are going to retire together. At least 4 and maybe more if the particle system is a 'one shot' setup.
Thirdly, we are copying whole particle data chunks about and pulling data into cache lines when we might just be discarding it again anyway. That's a lot of memory waste right away.
Finally, when accessing data it is done in cache lines, which means on an i7 you'll pull in 64bytes every time you access new data outside the cache. So when you access data at the back you are pulling in that data and more which won't get used. Also, modern CPUs try to guess at what you are accessing next so might well try to read in the next cache line as well, just in case, which as you are going backwards you aren't going to touch.

With this in mind I came up with a new plan;
- First find the first dead particle using blocks of 4; we are only accessing the age data at this point, forward to nice and cache friendly
-- If we don't find any then bail out because everyone is still alive
-- If we do find a dead particle then remember that block and go looking for the end of it (still only age data at this point)
--- If we hit the end of the active particles before we find an alive particle then everything from the start to end is dead so we reduce the used count and bail
--- If we find an end then we mark it and walk though that data trying to find either another dead particle or the end of the active particles
---- As we move we copy blocks of 4 of age data back over the old dead particles. These are in cache anyway so this doesn't cost us (assuming std::move uses none cache polluting 'stream' under the hood. If it doesn't then it'll need replacing as it'll be hurting the cache again)
--- Once we find the end point we mark it, figure out the length and then block copy the rest of the particle data down over the old data
--- Then we update the 'dead block' start so it is at the end of the data copied down, set index to the next 'dead block' we found and loop back to step 2

While CPU wise it looks more complicate we are touching a lot less data frequently and the CPU will be happier doing block copies than it would be with the initial method.
The code however... is a bit more complex...

int idx = 0;
// Search for first dead block of particles
while(idx < usedParticles && particleData.age[idx] > 0.0f)
idx +=4;
// - if not found then bail out
if(idx >= usedParticles)
int deadBlockStart = idx;
while(idx < usedParticles)
// Now search for end of block
while(idx < usedParticles && particleData.age[idx] <= 0.0f)
idx+= 4;

// if not found then everything from start to end was dead so reduce count and bail
if(idx >= usedParticles)
usedParticles = (usedParticles - deadBlockStart);
// tag where we go to for the 'alive' block
int aliveBlockStart = idx;
int movedParticleDest = deadBlockStart;
// now search for the end of this block of 'alive' particles
// moving blocks of 'ages' down as we go
while(idx < usedParticles && particleData.age[idx] > 0.0f)
float * start = particleData.age + idx;
float * end = start + 4;
float * dest = particleData.age + movedParticleDest;
std::move(start, end, dest);
// copy
int aliveBlockLength = movedParticleDest - aliveBlockStart;
MoveParticleBlockWithoutAge(aliveBlockStart, aliveBlockLength, deadBlockStart);
int oldUsedParticles = usedParticles;
usedParticles -= (aliveBlockStart - deadBlockStart);
// if we make it to the end then copy everything downwards
if(idx >= oldUsedParticles)
// Set idx to last block we looked at which could contain 'dead' particles
idx = movedParticleDest;
// and move the 'start' up the amount we have just copied so we are now
// passed the end of the data we have just copied down.
deadBlockStart += aliveBlockLength;

The main wins will come when the whole particle system is dead or at least large chunks of it towards the end.

Having completed that I decided to call it a day; it needs testing and I still need to work on the rendering system (probably a Pre/Render/Post system as well) and generally get it hooked up to my multithreaded test bed to make sure the one shot system works before refactoring into a one shot and continuous system setup.

Still, a good days work was had a I feel, until the next update *salutes*

C++0x as per VS2010

Posted by , 13 March 2010 - - - - - - · 259 views

So, I've come to a conclusion about C++0x; I do love the lambda stuff [grin]

float multiplier[] = { -1, 1};
int i = 0;
int j = 0;
std::for_each(dataPoints.begin(), dataPoints.end(), [&](PointAffinity dataPoint)
pData[i].m_driveValue = bands[dataPoint.side][j] * multiplier[sourceSelect] * 7.0f;
pData[i].position = dataPoint.position;
if(i % 2 == 0)

Something about being able to write that inline is rather nice [smile]

Then there is always this little bit of fun;

int counter = 0;
float multiplier = 1.0f;
std::for_each(dataPoints.begin(), dataPoints.end(), [&](PointAffinity &point)
point.incAmount *= multiplier;
if(point.side == 0)
if(counter % 2)
multiplier += 0.2f;

And having 'bind' as part of the std namespace is nice as well;

struct ID3D11CommandList;
struct ID3D11DeviceContext;

typedef std::function<void (ID3D11DeviceContext*)> deferredFunction_t;

struct RendererCommand
RendererCommand() : cmdID(EDrawingCommand_NOP), cmd(NULL), time(0) {};
RendererCommand(DrawingCommandType cmdID, ID3D11CommandList * cmd, DWORD time);
RendererCommand(DrawingCommandType cmdID, const deferredFunction_t &function, DWORD time);
RendererCommand(const RendererCommand &rhs);

DrawingCommandType cmdID;
ID3D11CommandList * cmd;
deferredFunction_t deferredFunction;
DWORD time;

RendererCommand DeferredComputeWrapper(const std::function<void (ID3D11DeviceContext*)> &func, ID3D11DeviceContext * context)
ID3D11CommandList * command = NULL;
context->FinishCommandList(FALSE, &command);
return RendererCommand(EDrawingCommand_Render, command, 0);

RendererCommand computeCmd = DeferredComputeWrapper(std::bind(UpdateCompute, width, height,_1), g_pDeferredContext);
Concurrency::send(commandList, computeCmd);
// or even
Concurrency::send(commandList, RendererCommand(EDrawingCommand_Function, std::bind(UpdateCompute, width, height, _1), 2));

Yeah, I think me and VS2010 + C++0x are going to get along just fine... [grin]

More Threads.

Posted by , 14 February 2010 - - - - - - · 337 views

While I do have a plan to follow up on my last entry with some replies and corrections (I suggest reading the comments if you haven't already) my last attempt to do so made it to 6 pages and 2500+ words so I need to rethink it a bit I think [grin]

However, this week I have some very much 'work in progress' code which is related to my last entry on threading. This is very much an experiment (and a very very basic one right now) so keep that firmly in mind when reading the below [grin]

As my last entry mentioned I toying with the idea of using a single always active rendering thread which was fed a command list and just dumbly executed it in order. While this is doable with current D3D9 and D3D10 tech (where your command list is effectively a bunch of functions to be called/states to be set) D3D11 makes this much easier with its deferred contexts.

In addition to this with the release of the VS2010 RC the Concurrency Runtime also moves towards final and provides us with some tools to try this out, namely Agents.

Agents, in the CR are defined as;

The Agents Library is a C++ template library that promotes an actor-based programming model and in-process message passing for fine-grained dataflow and pipelining tasks.

This allows you to set them off and then have them wait for data to appear before processing it and then passing on more data to another buffer and so on.

In this instance we are using an agent as a consumer of data for rendering, with a little feedback to the submitting thread to keep things sane.

Agents are really easy to use, which is an added bonus, simply inherit from the Concurrency::Agents class, impliment the 'run' method and you are good to go. At which point its just a matter of calling 'start' on an instance and away it goes.

In this instance I have an Agent which sits in a tight loop, reading data from a buffer and then, post-present, writing a value back which tells the sending thread it is ready for more data. The latter command is to try and prevent the data sender from throwing too much work at the thread in question so that it gets too far behind on frames (most likely in a v-sync setup where your update loop is taking < 16ms to process. For example if your loop took 4ms then you could write 4 frames before the renderer had processed one).

The main loop for the agent is, currently, very simple;

void RenderingAgent::run()
Concurrency::asend(completionNotice, 1);
bool shouldQuit = false;
RendererCommand command = Concurrency::receive(commandList);

case EDrawingCommand_Quit:
shouldQuit = true;
case EDrawingCommand_Present:
g_pSwapChain->Present(1, 0);
Concurrency::asend(completionNotice, 1);
case EDrawingCommand_Render:
g_pImmediateContext->ExecuteCommandList(command.cmd, FALSE);
SAFE_RELEASE( command.cmd );

The first 'asend' is used to let the data submitter know the agent is alive and ready for data, at which point it enters the loop and blocks on the 'recieve' function.

As soon as data is ready at the recieve point the agent is woken up and can process it.

Right now we can only understand 3 messages;
- Quit: which terminates the renderer, calling 'done' to kill the agent
- Present: which performs a buffer swap and, once that is done, tells the data sender we are ready for more data
- Render: which uses a D3D11 command list to do 'something'

'Render' will be the key to this as a D3D11 Command list can deal with a whole chunk of rendering without us having to do anything besides call it and let the context process it.

The main loop itself is currently just as simple;

Concurrency::unbounded_buffer<RendererCommand> commandList;
Concurrency::overwrite_buffer<int> completionNotice;

RenderingAgent renderer(commandList,completionNotice);
RendererCommand present(EDrawingCommand_Present, 0);

g_pd3dDevice->CreateDeferredContext(0, &g_pDeferredContext);

// Single threaded update type message loop
DWORD baseTime = timeGetTime();
while(WM_QUIT != msg.message)
if(PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
DWORD newTime = timeGetTime();
if(newTime - baseTime > 16)
float ClearColor[4] = { 0.0f, 0.125f, 0.3f, 1.0f };
g_pDeferredContext->ClearRenderTargetView(g_pRenderTargetView, ClearColor);
ID3D11CommandList * command = NULL;
g_pDeferredContext->FinishCommandList(FALSE, &command);
Concurrency::asend(commandList, RendererCommand(EDrawingCommand_Render, command));
Concurrency::asend(commandList, present);
baseTime = newTime;

RendererCommand quitCommand (EDrawingCommand_Quit, 0);
Concurrency::send(commandList, quitCommand);


The segment starts by creating two buffers for command sending and sync setup.

The 'unbound_buffer' allows you to place data into a quque for agents to pull out later. The 'overwrite_buffer' can store one value only, with any new messages overwriting the old ones.

After that we create out agent and start it up and create a 'present' command to save us from doing it in the loop. Next a deferred context is created and we go into the main loop.

In this case its hard coded to only update at ~60fps, although changing the value to '30' from '16' does drop the framerate down to ~30fps (both values were checked with PIX).

After that we check to see if the renderer is done and if so clear the screen, save that command list off, construct a new RendererCommand containing the pointer and send it of to the renderer. After that it passes the present command to the renderer and goes back around.

The final section is the shut down which is simply a matter of sending the 'quit' command to the renderer and waiting for the agent to enter its 'done' state.

At which point we should be free to shut down D3D and exit the app.

The system works, at least for simple clear screen setups anyway, I need to expand it a bit to allow for proper drawing [grin] althought thats more a case of loading the shaders and doing it than anything else.

The RendererCommand itself is the key link;
- This needs to be 'light' as it is copied around currently. If this proves a problem then pulling them from a pool and passing a pointer might be a better way to go in the future. Fortuntely such a change is pretty much a template change and a couple of changes from '.' to '->' in the renderer.

- The RendererCommand is expandable as well; right now it is only one enum value and a pointer, however it could be expanded to include function pointers or other things which could be dealt with in the renderer itself. This would allow you to send functions to the renderer which execute D3D commands instead of just saved display lists.

One of the key things with this system will be the use of a bucket sorted command list where each bucket starts with a state setup for that bucket to be processed (saved as a command list) and then each item in the bucket just setups its local state and then does its rendering.

I'm not 100% sure on how I'm going to handle this as yet however.

I'm also currently toying with making the main game loop an agent itself, effectively pushing the message loop into its own startup/window thread and using that to supply the game agent with input data.

There is however another experiment I need to do first with groups of tasks, requeueing and joining them to control program flow as my biggest issues are;
- how to control the transistion between the various stages of data processing, specifically when dealing with entity update and scene rendering.
- how to control when scene data submission occurs. This is more than likely going to end up as a task which runs during either the 'sync' or 'update' phase as at that point the data for the rendering segment should be queued up and sorted ready to go.

So, still a few experiments and problems to solve, but as this works I finish the weekend feeling good about some progress [grin]

Direct X vs OpenGL revisited... revisited.

Posted by , 04 February 2010 - - - - - - · 1,086 views

A short time ago David over at Wolfire posted a blog entry detailing why we should use OpenGL and not DirectX.

The internet, and indeed, his comments asploded a bit under that. I posted a few comments at the time but didn't revisit the site afterwards, mostly because it wouldn't keep me logged in and the comment system was god horrible.

The comments could be split up into the following groups;

- MS haters who want everything they touch to die, largely OSX and Linux users.
- OSX and Linux gamers who, despite not having any technical background, have decided that OpenGL is 'best', for mostly the same reasons as the above.
- Die hard OpenGL supporters, who believe that DX and MS are terrible but offer no technical details as to why.
- Die hard DX supporters who were just as bad as their OpenGL counterparts.
- A few people with both the technical opinion and experiance to back up what they were saying - the only bit of signal in alot of "fanboi" noise.

In a way it was sad to see this as this could have been a good chance to thrash out a few things.

David has since made a follow up post to address some of the arguments made, and while a bit more balenced still shows a slight anti-MS "I'm the only one who can see the truth" bias to it.

However, we'll come to that post in a bit, firstly I want to cover the inital post, which I meant to do at the time but never got around to it.

Before I go on I'd like to make a point or two to clear up my 'position' if you will;
- For those who have been around the site for a few years you'll know that I was the OpenGL forum moderator for some time and was heavily involved in that subforum.
- I've also written two articles on OpenGL for using the (at the time much misunderstood) FBO extension and I've had a chapter published in More OpenGL Game Programming regarding GLSL.

In short I'm not someone who grew up on DX and who lived and breathed MS's every word about the API. I spent some years on the OpenGL side of the fence "defending" it from those who use D3D and attacked it, both on the forums and on IRC. Its only in the last year and a half or so that I've dropped OpenGL in favour of D3D, first D3D10 then D3D11, so I've over 4x more years using OpenGL than I have D3D.

Why you should use OpenGL and not DirectX

The opening of the blog somewhat sets the tone, overly dramatising things and painting a picture that 'open is best!'.

It opens by saying they are met by 'stares of disbelief' and that 'the temperature of the room drops' when they mention they are using OpenGL. As I said, dramatic and I feel the first question is a valid one.

If someone said to me 'I plan to make a game, I plan to use OpenGL' I might well act surprised and at the same time I would ask the important question; Why?.

Not 'why opengl?' but why this choice; what technical reason did you have to decide that, yes for this project OpenGL is the best thing for you. If the answer comes back as 'we plan to target Windows, OSX and Linux' or mentions either of the latter OSes in a development sense than fair play, carry on you've made a good technical case for you.

Which is somewhat the key point here; the title suggest that everyone should use OpenGL over DirectX. Not 'use it if its technically right' or 'because our development choice requires it' but because you just should. It then goes on to try to explain these reasons, with a bias I considered quite intresting.

The technical case is a good place to start, because in that inital article he starts off by attacking anyone who 'goes crazy' over MS's newest proprietary API.


What kind of bizarro world is this where engineers are not only going crazy over Microsoft's latest proprietary API, but actively denouncing its open-standard competitor?

To which I have to say this; what kind of bizarro world would it be if engineers ignored the best technical solution for them in favour of one which doesn't fit it just because the one which does is 'open'?

This, to a degree, is part of my problem with the whole inital piece; it paints those who us DX as mindless sheep; people who are woed by shiney presentations and a few pressed hands at a meeting. Maybe some are, in much the same way that many people who use OpenGL do so because 'MS is evil'.

But, any true self respecting engineer or designer would look at things like this, look at the alternative and see how it fit their use case. MS reps could press hands all they want, but if someone goes away, tries it and it sucks... well, it won't catch on.

The History

There are only "minor" factual errors here, the main one painting a picture of OpenGL being on every things out there apart from the XBox. That last bit is true, the XBox doesn't expose OpenGL to the developer (nor is its D3D strictly D3D9) but everywhere?

Firstly, OpenGL|ES is not OpenGL. There are differences in both feature set and the way you would program the two APIs. As much as I'm sure people would love to be able to write precisely the same code for both your desktop and mobile device the reality is there are still differences. OpenGL and OpenGL|ES are working towards unity, but it isn't there yet.

OpenGL also isn't on the Wii or PS3. Well, the PS3 does have an OpenGL|ES layer but its slow and people just goto the metal. The Wii has an OpenGL-like fixed function interface, however the functions are different as it coding it.

So, in reality, OpenGL has Windows, Linux and OSX.
And yes, I won't deny that is more than the windows you get with just raw D3D (although XNA does tip things again).

I'm not going to argue numbers, I'm not market analysis or business man, I'm just correcting some facts.

Why does everyone use DirectX?

The first section of note, because there is nothing really in the 'network effects' section worth talking about is the 'FUD' section.

This, predictably, centres around Vista and the slide which, apprently, shows that OpenGL will only work via D3D. He even links to an image here and the HEC presentation to prove the point.

Amusingly both show his point, and indeed the point of the OpenGL community to be wrong; if there was any FUD then they cause it themselves.
Now, I was about at the time, when I first heard the news I too was outraged at the idea, there may even be a few posts by myself condeming things, so I was taken in by some negative spin as well.
The problem is lets really really look at that image; see the 'OpenGL32/OGL->D3D' box? See the line going from that top 'OpenGL32' section to 'OpenGL ICD'? Yep, that's right.. the OpenGL subsystem was always linked to the OpenGL ICD, an IHV-written segment of code.

So, if there was any FUD it was caused by an over reaction to a non-event; an over reaction NOT caused by MS but by the community itself. Talk about shooting youself in the foot; if anyone hurt OpenGL that day it was, ironically enough, those who liked it the most.

The misleading marketing campaigns... well, I still feel this is a bit of 'he said, she said'. DX10 did bring some new things to the table but it never really got much of a true run out to decide either way due to how Vista was greated. However the one thing to come out of all of this was that D3D9 games stuck around, no one jumped on D3D10 but I feel this is less of a problem with the API than the OS it was tied to.

Finally, there is the old favorite in any OpenGL vs DX debate; The John Carmack quote.

In a way I feel sorry for him because ANYTHING he says re:graphics gets jumpped on as 'the one true way' and a big thing gets made out of it by any camp who can use it to their own ends.

The quote in question is;

“Personally, I wouldn’t jump at something like DX10 right now. I would let things settle out a little bit and wait until there’s a really strong need for it,”

The problem is, the quote is missing context;


John Carmack, the lead programmer of id Software and the man behind popular Doom and Quake titles, said he would not like to jump to DirectX 10 hardware, but would rather concentrate on his primary development platform – the Xbox 360 game console.

“Personally, I wouldn’t jump at something like DX10 right now. I would let things settle out a little bit and wait until there’s a really strong need for it,” Mr. Carmack said in an interview with Game Informer Magazine.

This is not the first time when Mr. Carmack takes Microsoft Xbox 360 side, as it is easier to develop new games for the consoles. Mr. Carmack said that graphics cards drivers have been a big headache for him and it became more complicated to determine real performance of application because of multiply “layers of abstraction on the PC”. The lead programmer of id Software called Xbox 360’s more direct approach “refreshing” and even praised Microsoft’s development environment “as easily the best of any of the consoles, thanks to the company's background as a software provider”.

“I especially like the work I’m doing on the [Xbox] 360, and it’s probably the best graphics API as far as a sensibly designed thing that I’ve worked with,” he said.

And thus the reason becomes clear; its not because he loves OpenGL but because he prefers working on the XBox right now and finds the API, the one very much like D3D9 (although with some differences), "refreshing".

The usage of the original quote to try to prove a point is misguided at best, dishonest at worst.

At this point you could point at 'Rage' and say 'ah, but that is in OpenGL, therefore OpenGL MUST be better!'. However any such claims fail to take into account the years of work iD have done with OpenGL, the tools they have, the code base they have AND the technical requirements; as a proper engineer should.

Finally, we get down to the 'meat' of the main article, and this is where technical accuaracy takes a bit of a dive;

So why do we use OpenGL?


... in reality, OpenGL is more powerful than DirectX, supports more platforms, and is essential for the future of games.

Now, more platforms is true, I've covered it already and it can't be argued with, the other two statements however are an issue...

OpenGL is more powerful than DirectX

Well, we'll ignore the poor phrasing as DirectX is much more than OpenGL just from API weight alone, so lets focus on the "facts" presented.

So, D3D9 has slower draw calls than OpenGL; this is true on XP. No one is going to dispute this fact and it is down to a poor design choice of having the driver transition to kernel mode for each draw call. This is practically the reason 'instancing' was invented, and while it was invented to get around the cost of small object draw calls it did also open up some intresting techniques. OpenGL, while being faster for small object draw calls, lacked this feature. People asked for it, it didn't turn up until OpenGL3.x in any offical version (NV might well have had an extension for it before then but that's hardly the same as cross vendor support).

The thing is, that was XP, Vista changed the driver model to remove this problem and it no longer exists in a modern OS. It's still a consideration if you are doing D3D9/XP development but going forward is simply isn't an issue. More important is the need to reduce your draw calls to stop burning CPU time on them anyway.

This brings us nicely to the issue of 'extensions'.

Even when I was developing with OpenGL I viewed these as a blessing and a curse; ignoring the need to access them via a trival extension loader there was the issue of cross vendor support.

It's no secret that until recently ATI/AMD's OpenGL support was spotty. They didn't support as much as NV did and often lagged behind on newer versions. While I personally developed on an ATI machine this was still a source of annoyance at times (such as the slow appearance of async buffer copies from an FBO to a VBO via a PBO which appeared first in their Vista driver some time after NV's own effort.) even if I did manage to miss most of the bugs.

So while they do allow access to newer feature sets there is the cross vendor cost to pay; the choice if this is an advantage or if D3D methods of capbits (D3D9), fixed functionality (D3D10.x) or 'feature levels' (DX11) suits you better is a personal choice.

D3D also had a rudimentry extension system; ATI mostly included some 'hacks' you could perform to do special operations; such as hardware instancing on older cards and render to vertex buffer directly. These few features have the same 'cross vendor' cost as above however.

At which point we get to some down right incorrect 'facts' surounding tesselation.
The main one was;

The tesselation technology that Microsoft is heavily promoting for DirectX 11 has been an OpenGL extension for three years

Unfortuntely this is pretty much all wrong. The extension in question, provided only by AMD due to NV currently having no hardware which can do it, didn't appear in the public domain until last year. I know this because I was watching for it to see when it would finally make an appearance.

The other problem is that it is NOT the same thing; D3D11's tesselator consists of 3 stages;
- hull shader
- domain shader
- tesselation

This extension provides the 3rd part but not the other two. Right now, looking at the extension registry, I see nothing to indicate OpenGL supports these features, nor do I believe it will until NV get Fermi out of the door (May this year?) and have their own extension.

There is also this assertion;

I don't know what new technologies will be exposed in the next couple years, I know they will be available first in OpenGL.

This is nothing more than hand wavy feel good nonsense and is easy to disprove; where is OpenGL's Hull and Domain shader support? More importantly, on the subject of 'power' where is OpenGL's support to have 'N' deferred contexts onto which I can build "display lists" from 'N' threads and have them draw on the main thread? Or the other multi-thread things D3D11 brings to the table? Multi-core is the future, even if your final submit has to be on a single thread the ability to build up your data in advance is very important right now.

Finally, based on recent history and the way things are going MS are driving the tech now; if it continues as it is then the ONLY ways OpenGL is going to get a feature before a D3D version does is if the ARB stop playing 'catch up' with the spec and get a GL version ahead of D3D OR a vendor releases a card before a D3D release with OpenGL exension ready to go.

That point also goes hand in hand with the comment about 'the future of games'.
There are two things are work here which go against that statement;
- Firstly OpenGL hasn't been a threat to D3D for some time. D3D has been driving development of hardware forward while the ARB seemed to flounder around arguing internally. They have got better of late, but they are still behind.

- Secondly; consoles. They have the numbers and they are significantly easier to develop for. They are the driving force behind the big games at least and to a degree influence what those coming behind want to do.

Don't get me wrong, I wouldn't declare PC gaming to be 'dead' but it is very much a second fiddle these days.

The finally arguement put forward in this section is that 50% of users still have XP systems; yep can't fault that either, the Valve hardware survey which it was linked to does indeed back this up and this is a fine number to cling to if you are releasing a game now or maybe even in the next few months. However there is a reality here which it ignores; Windows 7.

The uptake of Windows 7 has been nothing if not fantasic, certainly after Vista's panning, and this is a trend expected to continue as new PCs come with it and gamers, seeing a 'proper' update from XP move across to it. If only half of those on XP with a DX10 card move across to Win7 then people who can run D3D11 based games, all be it on the D3D10 feature level, will 63% and I would suspect this will be true in reasonably short order. In short; if you are starting a game now to be released in a year or two then there isn't a good reason NOT to use D3D/DX11 based on those numbers (provided you are targetting only windows of course).

OpenGL is cross platform

Yep, agreed, although again we find a Carmack quote pulled out and then twisted to spin positive for OpenGL usage;

As John Carmack said when asked if Rage was a DirectX game, "It’s still OpenGL, although we obviously use a D3D-ish API [on the Xbox 360], and CG on the PS3. It’s interesting how little of the technology cares what API you’re using and what generation of the technology you’re on. You’ve got a small handful of files that care about what API they’re on, and millions of lines of code that are agnostic to the platform that they’re on." If you can hit every platform using OpenGL, why shoot yourself in the foot by relying on DirectX?

Again, the final assertion ignores the tools and existing tech iD have when dealing with OpenGL which makes it a viable choice for them, it also seems to ignore that they target D3D-ish for the X360 and the native lib for the PS3. Also, as pointed out earlier, OpenGL doens't get you 'every platform'; more than D3D, yes, however it then goes on to point out that XP users are the biggest single desktop gaming platform and with the migration to Win7 well underway it becomes less cut and dry.

OpenGL is better for the future of games

This... well, it reads more as an attack on MS than anything else. Talk of monopolistic attack and an 'industry too young to protect itself' is nothing more than a plea to the heart than a fact based arguement... so lets bring in some facts!

The 'attack' spoken of here would seem to speak of the FUD section earlier, but there are two problems with that.

Firstly, as I pointed out, the FUD over Vista was self inflicted. The community did it to themselves and yet somehow MS got the blame.

The second is the idea that programmers and engineers would use something just because someone showed them some pretty slides and said 'hey, use this' without taking the time to look into it. If D3D didn't deliver then no one would touch it outside of the Xbox, indeed if it hadn't then something else would have come up or the XBox wouldn't have existed in the first place.

Then there is an "industry too yong to protect itself"; the industry isn't that young.
I was playing games back when I was 5 years old, thats 25 years ago. The Atari 2600 was released in 1977, 33 years ago. The industry has been here for over 30 years, over many platforms so the idea that it is 'young' and undefended is strange. If anything programmers are one group who are, tradionally, very resistant to change, doing things they way they did back in the old day because it was good enought then. So for them to switch to D3D from OpenGL means there must have been a good reason.

And there is something which is rarely pointed out; OpenGL was indeed there first. Granted, for 3D acceleration it was beaten out by GLIDE initally with many games supporting it, however as the ICDs appeared games started to move across to OpenGL and away from GLIDE. Half-Life and Unreal Tournament stand out as two games which had GLIDE, OpenGL and D3D support with D3D being the less choice in those days.

In short, it was OpenGL's position to lose and they lost it. MS might well have had a hand in this when they were on the ARB (I don't know for sure) but they left in 2003 and yet nothing happened.

Which brings us to, what was for me, the highlight of the blog in the final section;

Can OpenGL recover?

Firstly, I would say yes it can, but it will need the features, cross vendor and the tools on windows and better support. It will need to give people a reason to switch away from D3D11 (or whatever follows).

However, this isn't the key bit as among the tugs on the heart strings and the 'exists only to stop you getting games on XP, Mac or Linux' rant there was this little gem;


If there's anything about OpenGL that you don't like, then just ask the ARB to change it -- they exist to serve you!

This gave me a few minutes of laughter for a good reason; experiance has taught me that the ARB couldn't find its own arse with 4 tries and a detailed map.

Infact this is a good entry point into the follow up article as well...


OpenGL 3.0 sucked! It was delayed drastically and didn't deliver on its promises!
OpenGL 3.0 was not the revolutionary upgrade that it was hyped to be, but it was still a substantial improvement. OpenGL 3.1 and 3.2 addressed many of the concerns not addressed by 3.0, and it looks like it's on track to keep improving! If more game developers start using OpenGL again, the ARB will have more incentive and ability to keep improving OpenGL's gaming features.

And between the two here in lies part of the problem.

Developers, including game developers, were presented with a much improved and above all modern API by the ARB. They told the ARB they loved the direction, gave feedback and generally made a big noise about looking forward to it; after the mess which was OpenGL2.0 and the amount of time it took to get VBOs it finally looked like D3D10 had given them the kick they needed.

You see, D3D10 has been hailed as a great improvement to the API; its usage fell flat due to Vista, however D3D11 is very much a slight refinement of it. Longs Peak was in the same manner, indeed it was better than D3D10 based on what little we had seen.

The ARB talked, we said 'awesome!' and then... well.. I don't know if we'll ever truely know.

It went silent, people asked for updates, nothing happened and finally, after a wall of silence had decended OpenGL3.0 was released and the Opengl.org forum asploded. Yep, they had done it again.

This is the problem with OpenGL and the ARB; they do it to themselves.

On the day OpenGL3.0 was announced numberous people, myself included, made a noise and then walked off to D3D10 and D3D11 land. Much like the FUD problems before the OpenGL community had crippled itself.

I know from a few PMs I had at the time there were people who worked on the Longs Peak spec who were just as upset about this turn of events.. well, more so.. than the end users who walked away. As I said I doubt we'll ever really know, all I do know is that, despite what went around at the time, I was told it wasn't the CAD developers who caused the problem.
(Personally, I think Blizzard and one of the IHVs sunk it... but again, we'll probably never know).

All of which brings me back to the two quotes above; the ARB have shown time and time again they can't get things done. MS, on the other hand, deliver. Between that and the tools, docs and stability of the drivers I know whos hands I'd rather put my future in.


Are you saying that AAA developers use DirectX just because they're too stupid to see through Microsoft's bullshot comparison ads? You're the only one who's smart enough to figure it out?
No, of course not. That kind of marketing primarily affects game developers via gamers. Since gamers and game journalists are not graphics programmers, they believe Microsoft's marketing. Then, when the gaming press and public are all talking about DirectX, it starts to make rational short-term business sense for developers to use DirectX and ride Microsoft's marketing wave, even if it doesn't make sense for other reasons.

On the other hand, game developers are directly targeted by DirectX evangelists and OpenGL FUD campaigns. At game developer conferences, the evangelists are paid to shake your hand and deliver painstakingly-crafted presentations and well-tested arguments about why your studio should use DirectX. Since nobody does this for OpenGL, it can be hard to make a fully informed decision. Also, not even the smartest developers could have known that the plans for dropping OpenGL support in Vista were false, or that the terrible Vista beta drivers were not representative of the real ones. It doesn't leave a bad taste in your mouth to be manipulated like this? It sure does for me.

Are these the only reasons why DirectX is so much more popular than OpenGL? No, but they're a significant factors. As I discussed at length in the previous post, there are many network effects which cause whichever API is more popular to keep becoming more popular, so small factors become very large in the long run.

I somewhat covered this earlier, but I feel its worth addressing this directly.

The first paragraph is certainly true now; gamers do talk alot more about D3D however it wasn't always the way. At one point OpenGL was the big name, not as big because it was a few years back now and the internet wasn't as connected as it is now, but it was still a major factor. For some years it was always a belief that OpenGL games looked better and I recall the asplosion which occured when it turned out Half-Life 2 wouldn't support OpenGL.

Which brings up two points;
- OpenGL was popular, yet it lost it before the "fanboys" and gamers had latched onto D3D
- Developers were already switching across to D3D only at this point, before the marketing factor which exists today kicked in

That alone tells us something about the fight between OpenGL and D3D on a technical merit.

The second paragraph seems to, yet again, cast doubt on the ability of other engineers to make a technical choice. Again, we need to view this with some history attached; before D3D9 DX wasn't really a 'big deal', DX7 sucked and DX8 wasn't better. MS, while putting cash in, wouldn't have been putting anywhere near as much in and, due to its popularity, OpenGL would have had a high share of technical knowledge; many people back in 2000 wanted to work with OpenGL simply because they attached the name Carmack to it.

Again, this was OpenGL's position to lose.

The final section about the FUD and beta drivers seems to also continue this theme of engineers and developers being naïve.

The FUD is somewhat forgiveable for the younger engineers and those who didn't look at closely to start with; someone misread something, a shit storm appeared and hurt OpenGL but not from MS as already mentioned. In fact, I dare say most developers who took the time to look (including myself after a while) would have realised no such dropping of OpenGL was going to happen.

As for the beta drivers... well; read your release notes.
No one should expect 'beta' to be final quality and, ATI at least, pointed out they didn't have any OpenGL drivers. I have a vague memory that NV made a point of saying they weren't final as well but I wouldn't swear to it.

Given this situation I feel it is David who in his reply is trying to maniplate the reader into something which, when looked at more closely, never happened or was never really a problem if people had thought about it. So, yes, it does leave a bad taste in my mouth when someone tries to repaint the past and manipluate me.

As for the rest of the reply, well he does do a decent job of being more honest and less evanglical than before;

- While he points out that you might not want OpenGL if you are doing a console exclusive or XBox game with a cheap windows port I disagree with the assertion that in other cases OpenGL is the logical place to start. Technically speaking D3D11 is the better API, it is cleaner and offers more features, but beyond that there is the issue of what you want to do; even if you are just developing for a home computer that doesn't mean you'll want to support OSX or Linux; with that consideration OpenGL isn't the logical choice although it is still a choice.

For me it wouldn't be, which is an example of technical thinking; I want to push cores to the limit and OpenGL just doens't have that multi-threaded support that D3D11 does; this is a technical barrier and only D3D11 can support what I want to do.

Infact this links into this statement;

If we take a larger view, the core functionality of Direct3D and OpenGL are so similar that they are essentially identical

To which I disgree; depending on the level of support and what constraints you are under (such as above) there are large difference between Core GL3.2 and DX11.


The most important differences are that one is an open standard, and the other proprietary, and that one works on every desktop platform, and the other does not.

Frankly, the first statement is rubbish; open standard vs propritary doesn't matter if one can do things the other can't. The second statement only matters if it dovetails with your plans, which may or may not be restricted by technical reasons.

After that it is mostly minor issues, a retraction on tesselation, all be it with a down play of the tech and saying OpenGL will be ready when it is important/popular (AvP would like to have a word with you about that one), which while nice doesn't really help developers get on and use it. He does also say that older methods can produce the same visable result as the older methods, which might be true however speed is an issue here and I doubt it'll be as fast (less so on Fermi as it can tesselate 4 triangles at once via a change in hardware). He questions the fixed function stage. ignoring the two programmable stages around it it seems and also ignoring there is no need for it to be programmable given its nature. It might well become more programmable in future, although I'm not sure how, but this is a sane stepping stone if that is the case. Finally a comment about ATI having tessleation for a decade but not being used, for good reason;
- Tru-form wasn't great
- The tesselator on the 360 wasn't great either
- The tesselator in the consumer cards wasn't exposed until last year

By contrast D3D11's setup has 1 game out using it (Dirt2) and AvP coming soon which makes heavy use of it with other games sure to follow.

And I think that covers all the important points.

As I hope you can see things aren't a cut and dried as some would like you to believe; D3D didn't muscle out the little guy via funding and FUD. The little guy was once the big guy and simply lost because it didn't improve and because it generated its own FUD.

So, to reformat and ask the original blog's question again I think would be a good way to end;

Why should you use OpenGL?

You should use it when it meets your technical demands.

But please; try to leave emotion at the door, its just an API after all, no need to try and tug at the heart strings.


Posted by , 13 January 2010 - - - - - - · 316 views

As I've mentioned before I've been working on a highly threaded particle system (not of late, but you know, its still in the pipeline as you'll see in a moment) however this has got me thinking about threading in general and trying to make optimal use of the CPU.

Originally my particle system was going to use Intel's Threading Building Blocks, however as I want to release the code most likely under zlib the 'GPL with runtime exception' license TBB is under finally freaked me out enough that I've decided to drop it in favour of using MS's new Concurrency Runtime which is currently shipping with the VS2010 beta.

One thing the CR lets you do is setup a scheduler which controls how many threads are working on things at any give time; if it matches hardware threads, priority, over subscription etc are options which can be set which grants you much more control over how the threads are used when compared to TBB.

Looking at this I got thinking about how to use threads in a game and more importantly how tasks can be applied to them.

If we consider the average single threaded single player game then the loop looks somewhat like this;

update world -> render

There might be variations on how/when the update happens but its basically a linear process.

When you enter the threaded world you can do something like this;

update \ sync
update ---> sync ---> render
update / sync /

Again, when and where the update/sync happens is a side point the fact is rendering again pulls us back to a single thread. You could run the update/sync threads totally apart from the render thread however that brings with it a problem of scalability and sync.

If you have 4 cores and you spawn 4 threads, one for each update and a render thread, and run them all at once then you need to sync between them which will involve a lock of some sort on the world. Scalibility also becomes a concern, more so if you assign each thread a task to carry out as when you throw more cores at it they will go unused.

You could still use a task based system however a key thing is that you might not be rendering all the time; so you could use those 3 threads to update/sync based on tasks but for some of the time the rendering thread will go idle which is time you might be able to use.

For example, assuming your game can render/update at 60fps, your rendering time might only take 4ms of time, which means that for ~12ms a frame a core could very well be idle and not doing useful work.

This is where over subscription comes into play; creating more threads than we have hardware to deal with it.

In a way, if you do a task based system which uses all the cores and you use something like FMOD then you'll already be doing this as it will create at least one thread in the background and other audio APIs do the same.

The key thought behind this is that a device in D3D (and OGL) terms is only ever owned by one thread, so unless you can force a rendering task onto a thread all the time issues start to come up. You might be able to grab the device to a thread and release it again however if this is even possible it would probably cause bad voodoo. For this reason you are pretty much stuck with what thread you render from.

As you are stuck with a thread anyway then why not create one specifically for the task of rendering? You could feed it work in the form of per-frame rendering data and let it do its thing while you get on and update the next frame of the game.

However, this would impact your performance as you'd have more threads looking for resources to run on than you'd have hardware to run them. So, the question becomes would it be better to lose Xms or would the fighting cost you less in the long run?

The matter of cache also comes up however the guys who worked on the CR bring up an important point; during your threads life you are more than likely to preempted anyway, at which point if you have affinity and masks set you'll stall until the CPU has freed that core, or you bounce cores and lose your cache. Chances are however even if you stick around and cost yourself time your cache is going to be messed with anyway so it might not be worth the hastle. (The CR will bounce threads as needed between cores to keep things busy for this reason).

The advent of D3D11 also makes this more practical as you can setup things as follows;

update \ sync \ pre-render
update ---> sync ---> pre-render ---> next frame
update / sync / pre-render /

----- render ------------------------>

In this case the pre-render stage can use tasks and deffered contexts to create the data the render thread will ultimately punt down to the GPU. This could also improve framerate as it will allow more object setup and maybe more optimal data to be passed to the GPU.

There remains matters of syncing the data to be rendered and what happens if you throw a fixed time step into the mix (although this is most likely solved by having the pre-render step run every loop regardless of update status and have it deal with interpolation) however the idea seems workable to me.

If anyone can see any serious flaws in this idea feel free to comment on them, I probably wont get around to this idea for a few months as it stands as I've a few things to do (not least of all the particle system [wink]) but its certainly an idea I'd like to try out.

Recent Entries

Recent Comments