• Create Account

# most efficient general rendering strategies for new GPUs

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

93 replies to this topic

### #41mhagain  Crossbones+   -  Reputation: 7595

Like
0Likes
Like

Posted 14 June 2012 - 08:17 PM

Please clarify. You are talking about these variables incrementing once per object on utterly diverse objects. Not incrementing from tree #1 to tree #1000 as per normal instancing practice, but instance #1 is a tree, instance #2 is a car, instance #3 is your mother-in-law, etc? If the later is true, exactly what causes this build-in gl_InstanceID to increment? And when it does increment, where is the array of transformation matrices if not in a uniform block.

I know about instancing, but if this applies in some convenient way to diverse objects, then you are correct --- I did not notice this, and it is very cool.

I'm not sure just where you're getting that idea from - what I am talking about here is standard instancing.

You seem to be missing the point that even with standard instancing you will have multiple instances of the same object. You won't just have 10,000 completely unique objects in your scene - you'll have many cars, many trees, many people. So you sort them by object type and draw them using standard instancing.

You also seem to have this odd idea that re-upping vertexes is going to be much faster than re-upping matrixes. Maybe you have a hypothetical best-case where it is (as you mention, only when objects change) but your worst-case (when everything needs to change) is going to be at least an order of magnitude slower. You're missing the point that this is important. Maintaining a stable and consistent framerate gives a much better player experience than having a framerate that's see-sawing all over the place.

That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects), and must call some kind of glDrawElements() function 10,000 times (once per object) instead of just once (or more commonly 10 to 100 times).

Again, no, no, no, no, no. The example of BattleField 3 was given to you - it instances everything. It does not make one draw-call per object per frame. What part of that is not going into your brain?

You have a theoretical setup - BF3 is out there saying "Eppur si muove". This works. It's a solved problem.

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

### #42maxgpgpu  Crossbones+   -  Reputation: 279

Like
0Likes
Like

Posted 14 June 2012 - 08:23 PM

Please clarify. You are talking about these variables incrementing once per object on utterly diverse objects. Not incrementing from tree #1 to tree #1000 as per normal instancing practice, but instance #1 is a tree, instance #2 is a car, instance #3 is your mother-in-law, etc? If the later is true, exactly what causes this build-in gl_InstanceID to increment? And when it does increment, where is the array of transformation matrices if not in a uniform block.

I know about instancing, but if this applies in some convenient way to diverse objects, then you are correct --- I did not notice this, and it is very cool.

I'm not sure just where you're getting that idea from - what I am talking about here is standard instancing.

You seem to be missing the point that even with standard instancing you will have multiple instances of the same object. You won't just have 10,000 completely unique objects in your scene - you'll have many cars, many trees, many people. So you sort them by object type and draw them using standard instancing.

You also seem to have this odd idea that re-upping vertexes is going to be much faster than re-upping matrixes. Maybe you have a hypothetical best-case where it is (as you mention, only when objects change) but your worst-case (when everything needs to change) is going to be at least an order of magnitude slower. You're missing the point that this is important. Maintaining a stable and consistent framerate gives a much better player experience than having a framerate that's see-sawing all over the place.

That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects), and must call some kind of glDrawElements() function 10,000 times (once per object) instead of just once (or more commonly 10 to 100 times).

Again, no, no, no, no, no. The example of BattleField 3 was given to you - it instances everything. It does not make one draw-call per object per frame. What part of that is not going into your brain?

You have a theoretical setup - BF3 is out there saying "Eppur si muove". This works. It's a solved problem.

I really don't know what you mean when you say BattleField 3 instances everything. The only thing I can figure is... there are ZERO objects that only exist once. If you have a GeneralPatton object, then you have several instances of GeneralPatton... only with the kind of variations from one displayed-instance to another that instancing provides. If that's what you mean, then I understand what you mean. But that sounds like a mighty strange game! Having said that, I can certainly imagine lots of games where a large majority of displayed objects are instanced... blades of grass, trees, leaves, bricks, stepping stones, and so forth. Maybe you don't mean "everything" literally.

### #43maxgpgpu  Crossbones+   -  Reputation: 279

Like
0Likes
Like

Posted 14 June 2012 - 08:27 PM

If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.

It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago.

Yes, that works if you accept larger AABB that are actually required by the object in some to many orientations of the object. And for lots of purposes in lots of applications that's probably just fine.

### #44maxgpgpu  Crossbones+   -  Reputation: 279

Like
0Likes
Like

Posted 14 June 2012 - 08:30 PM

If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.

It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago.

Yes, that works if you accept slightly larger AABB than are actually required by the object in some to many orientations of the object. And for lots of purposes in lots of applications that's probably just fine.

And yes, the framerate must be constant.

Edited by maxgpgpu, 16 June 2012 - 02:59 AM.

### #45maxgpgpu  Crossbones+   -  Reputation: 279

Like
0Likes
Like

Posted 14 June 2012 - 08:33 PM

If the CPU doesn't transform ALL object vertices to world-coordinates, it doesn't know the object AABB.

It does - you calculate an initial bbox from the initially untransformed vertexes at load time, then you just transform the corners of the bbox. 2 transforms versus one per-vertex is all that you need.

Even Quake 2 did that back in 1997. Really, you're coming across as if you've created an artificially complex solution to a problem that has already been solved in a much simpler, cleaner and more performant manner well over a decade ago.

Yes, that works if you accept larger AABB that are actually required by the object in some to many orientations of the object. And for lots of purposes in lots of applications that's probably just fine.

And yes, the framerate must remain constant --- always.

### #46dpadam450  Members   -  Reputation: 862

Like
0Likes
Like

Posted 14 June 2012 - 09:47 PM

So on a typical frame, somewhere between 0 to 10~20 objects out of 10,000 objects gets uploaded to the CPU in my proposed technique. That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects),

Yea that makes no sense. First if you want to maximize shaders and what not, then nobody wants to upload vertices each frame. Your trading the cost of sending a matrix vs the cost of sending say 1,000 vertices per object (for moving objects which you say may be 10 per frame)? Really? Are you doing any AA, SSAO, Anisotropic filtering?

I'm pretty sure I had a discussion on here before and it must have been you. Are you putting your whole scene into 1 VBO and calling draw on it? Do you perform any culling? If not you are the dude I am thinking of and again I will say you are completely wrong. I want to see some snapshots, because you might yes stuff is fast so theoretically you can do whatever you want, but if you are rendering in 1080p, with some cranked up effects, then you are missing performance. You also imply that calling glDraw.... for each object is bad. Well its not so bad if the GPU is queued up for work, because it's probably already behind in draw commands that your new ones are not even slowing it down.

So, if no objects rotate or move, no objects are uploaded to the GPU.

Further more if you are the one without culling and 1 vbo for the whole scene, then your telling me you agree with wasting the GPU's time drawing say a 10K poly statue model or any model for that matter, even when you can't see it to save a tiny draw call. Not to mention that for indoor scenes you see 2% of the world and outdoors you only see 1/2 of the world.

Clarify if you are or aren't the guy I'm thinking of, and if so I want to see pics. Because the trees in my game are like 20K polys each, I could get away with your method if they were like 100 polys each, but I want my stuff to look good, which means tons more polys, and that means you will never achieve 1 VBO with 20K poly trees x say 100 trees.

### #47Hodgman  Moderators   -  Reputation: 28587

Like
1Likes
Like

Posted 14 June 2012 - 09:50 PM

See, this is the problem. And frankly, I can't blame you, because --- as I'm sure you're aware --- once any of us adopts a significantly non-standard approach, that radically changes the nature of many other interactions and tradeoffs. Since you and virtually everyone else has so thoroughly internalized the tradeoffs and connections that exist in more conventional approaches, you don't (because you can't, without extraordinary time and effort), see how the change I propose impacts everything else.

No, it doesn't take that much effort to evaluate other approaches -- there is no "conventional" apprach; we're always looking for alternatives. I've used methods like this on the Wii, where optimised CPU transform routines were better than it's GPU transforms (and also on the PS3, where it's got way more CPU power than GPU power), but without knowing the details of the rest of those specific projects, we can't judge here the cost/value of those decisions.

The problem is that your alternative isn't as great or original as you think it is. However, if we consider two extremes -- complete CPU vertex processing vs entirely GPU driven animation/instancing of vertices -- then you just have to consider that there is a continuous and interesting middle ground between the two extremes at all times. Depending on your target platforms, you find different kinds of scenes/objects are faster one way or the other. Different games also have different needs for the CPU/GPU of their own -- some need all the CPU time they can get, and some leave the CPU idle as the GPU bottle-necks. Our job should be to always consider this entire spectrum, depending on the current situation/requirements.

I understand why this seems counterintuitive. Once you adopt the "conventional way" (meaning, you store local-coordinates in GPU memory), you are necessarily stuck changing state between every object. You cannot avoid it. At the very least you must change the transformation matrix to transform that object from its local coordinates to screen coordinates.

No, this is utter bullshit. You haven't even understood "conventional" methods yet, and you're justifying your clever avoidance of them for these straw-man reasons? Decent multi-key sorting plus the simple state-cache on page#1 removes the majority of state changes, and is easy to implement. I use a different approach, but it still just boils down to sorting and filtering, which can be optimised greatly if you care.

Moreover, there's no reason that you have to have "the transform matrix" -- it's not the fixed function pipeline any more. Many static objects don't have one, and many dynamic objects have 100 of them. Objects that use a shared storage tehcnique between them can be batched together. There's no reason I couldn't take a simmilar approach to your all-verts-together technique, but instead store all-transforms-together.

However, your argument more-or-less presumes lots of state-changes, and correctly states that with modern GPUs and modern drivers, it is a fools game to attempt to predict much about how to optimize for state changes.

However, it is still true that a scheme that HAS [almost] NO STATE CHANGES is still significantly ahead of those that do. Unfortunately, it is a fools game to attempt to predict how much far ahead for every combination of GPU and driver!

I don't agree with how you've characterised that. I meant to imply that state-change costs are unintuitive, but that they are still something that can be measured. If it can be measured, it can be optimised. If it cant be measured, then your optimisations are just voodoo.

Also, just because you've got 0 state changes, that in no way grants automatically better performance. Perhaps the 1000 state-changes that you avoided wouldn't have been a bottle-neck anyway; they might have been free in the GPU pipeline and only taken 0.1ms of CPU time to submit? Maybe the optimisation you've made to avoid them is 1000x slower? Have you measured both techniques with a set of different scenes?

I really don't know what you mean when you say BattleField 3 instances everything. The only thing I can figure is... there are ZERO objects that only exist once. If you have a [Soldier0] object, then you have several instances of [Soldier0]... only with the kind of variations from one displayed-instance to another that instancing provides. If that's what you mean, then I understand what you mean. But that sounds like a mighty strange game! Having said that, I can certainly imagine lots of games where a large majority of displayed objects are instanced... blades of grass, trees, leaves, bricks, stepping stones, and so forth. Maybe you don't mean "everything" literally.

Maybe you don't want to waste VBO memory on storing a unique mesh for every "Soldier" in the game, so you only store a single (local space) "base mesh" that's kind of an average soldier. Your art team still models 8 unique soldiers that the game requires, as usual, but they are remapped as deltas from the base mesh. You then store your 1 base-mesh and 8 delta/morph-meshes together, and then instance 100 characters using the 1 base-mesh, each of which will be drawn as one of the 8 variations as required/indexed.

In your system, we not only need the original 9 data sets, but also enough memory for the 100 transformed data-sets. We then need enough GPU memory for another ~200 (n-hundred) transformed meshes, because it's dynamic data and the driver is going to want to ~double buffer (n-buffer) it.

I really don't know why I bothered with this, as you're being very dismissive of the real criticism that you solicited, but again, look at your memory trade-offs... The "conventional" requirements look like:

whereas yours looks like:

i.e. conventional memory usage
= Models.Verts * sizeof(Vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms

= Models.Verts * sizeof(vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms
+ Instances * Models.Verts * sizeof(vert) //world data
+ Instances * Models.Verts * sizeof(vert) * N // Driver dynamic VBO buffering.

If you're going next-gen, you're looking at huge object counts and huge vertex counts. In the first equation above, there is no relation between the two -- increasing instances does not impact vertex storage, and increasing vertices does not increase the per-instance storage cost. However, in the latter, they are connected -- increasing the amount of detail in a model also increases your per-instance memory costs.
Seeing as modern games are getting more and more dynamic (where more things are moving every frame) and more and more detailed (where both vertex and instance counts are increasing greatly), I'd give your users the option to keep their costs independent of each other.

e.g. the example I gave earlier -- of 100 animated characters, with 100 bones each, using a 1M vertex model -- clearly is much more efficient in the "conventional" method than in yours. So if you wanted to support games based around animated characters, you might want to support the option of a different rendering mode for them.

Edited by Hodgman, 14 June 2012 - 11:47 PM.

### #48zacaj  Members   -  Reputation: 643

Like
0Likes
Like

Posted 14 June 2012 - 10:23 PM

So on a typical frame, somewhere between 0 to 10~20 objects out of 10,000 objects gets uploaded to the CPU in my proposed technique. That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects),

Yea that makes no sense. First if you want to maximize shaders and what not, then nobody wants to upload vertices each frame. Your trading the cost of sending a matrix vs the cost of sending say 1,000 vertices per object (for moving objects which you say may be 10 per frame)? Really? Are you doing any AA, SSAO, Anisotropic filtering?

I'm pretty sure I had a discussion on here before and it must have been you. Are you putting your whole scene into 1 VBO and calling draw on it? Do you perform any culling? If not you are the dude I am thinking of and again I will say you are completely wrong. I want to see some snapshots, because you might yes stuff is fast so theoretically you can do whatever you want, but if you are rendering in 1080p, with some cranked up effects, then you are missing performance. You also imply that calling glDraw.... for each object is bad. Well its not so bad if the GPU is queued up for work, because it's probably already behind in draw commands that your new ones are not even slowing it down.

How is the resolution you're rendering at affected AT ALL by the number of vertices you're uploading/saving? You're drawing the same amount either way, the only speed difference is in upload time and call time

### #49dpadam450  Members   -  Reputation: 862

Like
0Likes
Like

Posted 14 June 2012 - 10:33 PM

Your GPU is running pixel shaders on more pixels. Do you want it to spend more time shading those or taking a piss while you upload a ton of vertices before it can even start to draw them. By leaving the data on the gpu, it is immediately able to start transform and shaders. I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

### #50Hodgman  Moderators   -  Reputation: 28587

Like
1Likes
Like

Posted 14 June 2012 - 10:35 PM

How is the resolution you're rendering at affected AT ALL by the number of vertices you're uploading/saving? You're drawing the same amount either way, the only speed difference is in upload time and call time

This is completely off-topic, but... as you lower your resolution, you should also lower the LOD level of your detailed models (lower number of triangles, larger size each) -- the GPU performs pixel-shading in 2x2 pixel quads, triangles smaller than 2x2 pixels still effectively run the pixel shader 4 times for the whole 2x2 quad. So for example, if you are using single-pixel-sized triangles, then your pixel-shaders all suddenly become 4x more expensive. I've seen a huge speed-up in pixel-shading times before by simply adding LOD support to a game.

Your GPU is running pixel shaders on more pixels. Do you want it to spend more time shading those or taking a piss while you upload a ton of vertices before it can even start to draw them. By leaving the data on the gpu, it is immediately able to start transform and shaders. I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

A high-spec PC won't block on uploads like this usually, it will instead just add latency by increasing the number of frames it buffers commands before execution (which results in input-lag, higher memory requirements).

Edited by Hodgman, 14 June 2012 - 10:36 PM.

### #51dpadam450  Members   -  Reputation: 862

Like
0Likes
Like

Posted 14 June 2012 - 10:53 PM

Are you suggesting that if I have a single VBO and say triple buffered rendering, that if I update that VBO on all 3 of those frames that it would buffer that VBO into temporary arrays? I guess it would have to store the new frame-based VBO into temporary buffers. Or is it just going to overwrite the single VBO memory with the most recent buffered command. I would assume texture/vbo uploads are not put into the command buffer and take place immediately.

As for blocking, if I'm only using double buffering, then all my commands are being drawn to the current back buffer, so if I want to send 50K verts at the same time the GPU has finished its current commands (going idle) I'd rather just say "Draw 50K model" than send 50K verts, and then after all verts have made it, tell it to draw them. In that time again the GPU just has a ton to do, so with resolution I was getting at well boost up the resolution to HD effectively giving it even more hard work to do, while at the same time stalling it, and your framerate is going down quick. You upgrade to next-gen quality models at HD, then you need to help your GPU do its job faster, not waste time sending thousands (maybe even 100K) vertices. It is a complete waste, in that time I could have processed full scene SSAO or something.

I'm just pretty sure this guy is the one that says "throw the entire scene (static objects only) into 1 giant vbo and draw it all the time, you don't need to cull because its faster to use 1 draw call than 100" Realistically when does anyone ever draw more than 1,000 objects on screen at once anyway? If you cull, then 1 draw call vs 100 or even 1000 is so negligible. 1000 draw calls is almost nothing. Collapsing that to 1 is not very much faster performance, in fact I bet in most cases unmeasurable.

Edited by dpadam450, 14 June 2012 - 11:03 PM.

### #52Hodgman  Moderators   -  Reputation: 28587

Like
0Likes
Like

Posted 14 June 2012 - 11:23 PM

Are you suggesting that if I have a single VBO and say triple buffered rendering, that if I update that VBO on all 3 of those frames that it would buffer that VBO into temporary arrays? I guess it would have to store the new frame-based VBO into temporary buffers. Or is it just going to overwrite the single VBO memory with the most recent buffered command. I would assume texture/vbo uploads are not put into the command buffer and take place immediately.

If you're updating and rendering from a VBO each frame, the driver will probably n-buffer it depending on how much latency is in the command stream.
In the general case, when using some bound data (like a VBO, via a command to the GPU) resource fences are inserted by the driver after the draw-call, which are commands in the command stream that quickly write-back to the CPU driver, letting it know that the command using that VBO has been completed (so now the VBO's VRAM can be re-used). If you map/lock a resource, the driver will check it's fences to see if that resource is still waiting to be consumed by the GPU, and if so, will have to allocate another copy to map/return to you.
For example, if you're locking a VBO every frame, and there's 2.5 frames of latency between the two processors, then data that's written on Frame#1 might not be consumed until the CPU is up to Frame#4! Uploading more data per frame will likely increase the latency, which increases the amount of buffering RAM required... In a good case, there's only 1 frame of latency so only double buffering is required for dynamic buffers. However, for all we know, the driver might be very optimised for dynamically giving out buffers from a pool like this...

As for blocking, if I'm only using double buffering, then all my commands are being drawn to the current back buffer...

The internal buffering of GPU commands is different to back/front buffering -- the mechanisms for single/double/tripple-buffered flipping are implemented as commands in the command-stream just like everything else. The driver will determine the amount of latency to buffer commands for based on the conditions of your app. If you've got a lot of upload traffic, it's probably going to have to cover for that with latency automatically.

Realistically when does anyone ever draw more than 1,000 objects on screen at once anyway? If you cull, then 1 draw call vs 100 or even 1000 is so negligible. 1000 draw calls is almost nothing. Collapsing that to 1 is not very much faster performance, in fact I bet in most cases unmeasurable.

On my last game, we had about 2000 D3D9 draw-calls per frame (plus required state-changes per draw-call), which cost less than 1ms of CPU time in our optimised renderer.

Edited by Hodgman, 14 June 2012 - 11:41 PM.

### #53dpadam450  Members   -  Reputation: 862

Like
0Likes
Like

Posted 14 June 2012 - 11:53 PM

The diagram was basically what I was mentioning. But if the VBO is set to STATIC_DRAW and never updated, I would have to assume that there wouldn't be those copies laying around. A simple test would be too fill a command buffer with glBindBuffer/glBufferData calls over and over again and see if vbo memory goes up.

What also I was getting at is what if the command buffer hits the end? glSwapBuffers obviously takes care of some syncing because the command buffer isn't going to just fill up with 800 frames to render while still working at frame 500. So the command buffer has to be limited by frame and once its too far ahead, wait for a previous frame to finish. But also in double buffering though, the commands it is receiving are the ones that its getting ready to put up after glSwapBuffers, so at some point the GPU can catch up and stall with no commands left.

### #54Hodgman  Moderators   -  Reputation: 28587

Like
1Likes
Like

Posted 15 June 2012 - 12:23 AM

Yes the driver will choose a different allocation strategy based on the flags/hints you give it. Static ones aren't likely to suffer this memory-bloat penalty, but might do something worse if used in a dynamic manner, like block if used by both processors in nearby/alternate frames.

The command buffer is probably implemented kind of like a ring buffer, so either the GPU can catch up to the CPU-flush marker, or the CPU can catch up to the GPU-read marker. N.B. the flush marker is somewhere between the read/write markers as below. The driver will periodically move the flush marker up to the write marker.

[][][][][][][]foobargarbage
^       ^     ^
|       |     CPU Write cursor
|       CPU Flush marker
GPU Read cursor
The CPU can fill up the command buffer during any API call, which will either block, allocate more memory for the ring, or send the command to a background thread for processing. During glSwapBuffers/etc, to keep the CPU from getting too far ahead (to the point where it will potentially use up all command memory), the driver can sync it to a particular frame/vblank by waiting for a fence object in the command stream again, which it placed after a particular flip command. After the GPU flips, it processes the fence, which lets the driver know that this particular frame has been flipped. By waiting on different fences, the driver can keep the CPU from getting further than 0,1,2,etc... frames ahead. If we then allocate enough ring memory to cover 1,2,3,etc... frames worth of commands, then the CPU running out of buffer memory becomes an exceptional event, which you can debug.
The opposite shouldn't happen, where the GPU catches up to the CPU. If it does, then either (1) you need to be more efficient - 3ms of CPU GL calls can easily produce 33ms of GPU work to keep it busy, or (2) some other part of your game is hogging the CPU too much and you need to make sure the renderer runs reliably once every 33ms/etc...

Edited by Hodgman, 15 June 2012 - 12:53 AM.

### #55zacaj  Members   -  Reputation: 643

Like
0Likes
Like

Posted 15 June 2012 - 08:50 AM

I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

"If you lower the amount of processing in the bottleneck, it will work better!"

### #56mhagain  Crossbones+   -  Reputation: 7595

Like
0Likes
Like

Posted 15 June 2012 - 06:33 PM

So, maxgpgpu, let me get this straight. You have the most powerful hardware on the planet available to you - your GPU - but yet you're flat-out refusing to use it for the kind of processing it's best at. You have well-known, tried-and-tested solutions for collision and bboxes that have been proven to work for over a decade, but yet you're also flat-out refusing to use them. You have a VBO solution that involves needing to re-up verts to arbitrary positions in the buffer in a non-predictable manner - congragulations, you've just re-invented the worst case scenario for VBO usage - you really should profile for pipeline stalls sometime real soon.

None of this is theoretical fairyland stuff. This is all Real, this is all used in Real programs that Real people use every hour of every day of every week. You, on the other hand, have a whole bunch of theoretical fairyland stuff. It's not the case that you're a visionary who's ideas are too radical for the conservative majority to accept. It is the case that your ideas are insane.

Am I missing anything here?

It appears that the gentleman thought C++ was extremely difficult and he was overjoyed that the machine was absorbing it; he understood that good C++ is difficult but the best C++ is well-nigh unintelligible.

### #57kunos  Crossbones+   -  Reputation: 2203

Like
0Likes
Like

Posted 15 June 2012 - 10:47 PM

Am I missing anything here?

what's missing is what people that are serious about their claims usually produce: proofs.
I have exactly the same feeling that maxgpu here is completely missing the point of having a GPU in the first place and I am surprised that, in 3 pages, nobody has actually called him to show what he actually accomplished with his "peculiar" approach.

I don't understand how is it possible to be serious and propose to upload an entire CPU transformed VB to avoid setting a cbuffer with some values. The guy is missing the entire point why vertex shaders, geometry shaders and instancing exists in the first place. Looks like plain old trolling to me... at least until we see a moving demo of this ahemmm weird approach in action. But every dev with some real experience knows this isn't going to happen.
Stefano Casillo
AssettoCorsa - netKar PRO - Kunos Simulazioni

### #58maxgpgpu  Crossbones+   -  Reputation: 279

Like
0Likes
Like

Posted 15 June 2012 - 11:43 PM

So on a typical frame, somewhere between 0 to 10~20 objects out of 10,000 objects gets uploaded to the CPU in my proposed technique. That's clearly more overhead uploading than the conventional way, but then again the conventional way must perform massively more transfers of matrices to the GPU (one per object versus one for all objects),

Yea that makes no sense. First if you want to maximize shaders and what not, then nobody wants to upload vertices each frame. Your trading the cost of sending a matrix vs the cost of sending say 1,000 vertices per object (for moving objects which you say may be 10 per frame)? Really? Are you doing any AA, SSAO, Anisotropic filtering?

I'm pretty sure I had a discussion on here before and it must have been you. Are you putting your whole scene into 1 VBO and calling draw on it? Do you perform any culling? If not you are the dude I am thinking of and again I will say you are completely wrong. I want to see some snapshots, because you might yes stuff is fast so theoretically you can do whatever you want, but if you are rendering in 1080p, with some cranked up effects, then you are missing performance. You also imply that calling glDraw.... for each object is bad. Well its not so bad if the GPU is queued up for work, because it's probably already behind in draw commands that your new ones are not even slowing it down.

No, I have many VBOs. Each contains the objects in a specific volume of 3D space. For the "conventional way" (excluding instancing for the moment), you set the transformation matrix and call glDrawElementsRange() or equivalent once per object. So it is completely convenient to test the object AABB (or OOBB) against the frustum, and draw only if they intersect. In my scheme, I usually draw the entire VBO, so I test the entire 3D volume assigned to the VBO against the frustum and draw --- or not draw --- ALL objects in the VBO.

So yeah, I cull. I just don't cull as fine grain as the conventional way. That does mean I render some objects outside the frustum... just not very many.

So, if no objects rotate or move, no objects are uploaded to the GPU.

Further more if you are the one without culling and 1 vbo for the whole scene, then your telling me you agree with wasting the GPU's time drawing say a 10K poly statue model or any model for that matter, even when you can't see it to save a tiny draw call. Not to mention that for indoor scenes you see 2% of the world and outdoors you only see 1/2 of the world.

Clarify if you are or aren't the guy I'm thinking of, and if so I want to see pics. Because the trees in my game are like 20K polys each, I could get away with your method if they were like 100 polys each, but I want my stuff to look good, which means tons more polys, and that means you will never achieve 1 VBO with 20K poly trees x say 100 trees.

No, I do cull, just not as fine-grain as others. So yeah, I lose a little there, but not that much.

### #59maxgpgpu  Crossbones+   -  Reputation: 279

Like
1Likes
Like

Posted 15 June 2012 - 11:55 PM

So, maxgpgpu, let me get this straight. You have the most powerful hardware on the planet available to you - your GPU - but yet you're flat-out refusing to use it for the kind of processing it's best at. You have well-known, tried-and-tested solutions for collision and bboxes that have been proven to work for over a decade, but yet you're also flat-out refusing to use them. You have a VBO solution that involves needing to re-up verts to arbitrary positions in the buffer in a non-predictable manner - congragulations, you've just re-invented the worst case scenario for VBO usage - you really should profile for pipeline stalls sometime real soon.

None of this is theoretical fairyland stuff. This is all Real, this is all used in Real programs that Real people use every hour of every day of every week. You, on the other hand, have a whole bunch of theoretical fairyland stuff. It's not the case that you're a visionary who's ideas are too radical for the conservative majority to accept. It is the case that your ideas are insane.

Am I missing anything here?

Yes. The most important thing you're missing is this. I'm here for brainstorming. If that means I learn 100x more from others than they learn from me... that's great for me! And I'm happy. What I'm not here for is to prove anything to anyone. I couldn't care less about that. But apparently you and others think my purpose here is to convince you that I'm a genius and I've figured out the greatest idea since sliced bread. Hell no! My whole purpose here is to try to make sure I'm headed down the right path, learn other perspectives, hear new ideas (where "new" means "new to me"). Maybe I'll have to run my own benchmarks and find out for myself. Or maybe someone will say something that rings a bell, and I can tell what's better (for situations x or y or z) without implementing and benchmarking everything myself.

What else you seem to be missing... maybe... is that both of us are not [fully] understanding the points of each other. In my case, I am aware of some of these cases. And often the points people make assume some variation of the conventional way, when that doesn't apply given the alternative I propose. But often it seems like others are aware of NONE of my points, and are purposely ignoring them. I'll write that off to not wanting to read all my messages. I understand that. But also, sometimes I say "x is good" for something or in some cases, and people go apeshit and post in reply that "he says we should do everything the x way". I never said that, but that is, for some reason, what people take away. If I try to over-qualify everything, people will hassle me about "writing too much". Believe me, I've been beaten to death for that too. My favorite is when someone falsely claims I said something I never said, then others pile on without ever finding where I said that --- which I never did. That's a lot of fun, and a waste of everyone's time.

Edited by maxgpgpu, 16 June 2012 - 03:20 AM.

### #60maxgpgpu  Crossbones+   -  Reputation: 279

Like
0Likes
Like

Posted 16 June 2012 - 12:08 AM

I'm getting at his stuff might work for a scene with low poly objects and low resolution and low effects.

"If you lower the amount of processing in the bottleneck, it will work better!"

That is a universal truth. In my approach, it is a bit more complex than the conventional approach.

In the normal case, where few objects rotate or move each frame, the bottleneck is inherently the GPU, because the CPU has little to do.

In one abnormal case, where most objects rotate or move each frame, the bottleneck can become the CPU, for the reasons people here are screaming bloody murder. That is, the CPU must transform the vertices of all rotated/moved objects in each batch and transfer them to VBOs in the GPU before it draws the batch. That's why I went to the trouble of writing the fastest-possible 64-bit (and 32-bit) vertex transformation routines possible in SIMD/AVX/FMA4 assembly-language.

Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

PARTNERS