most efficient general rendering strategies for new GPUs

Boot Strap · 2012-07-04T19:05:01

The moral of the story is that draw calls are still not free, but you don't need to pathologically avoid them as much as you did before. 7.5k calls in a shipping AAA title would certainly have given everyone the horrors not so long ago. "Going crazy" can work in both directions... Indexing into a texture array is a decent way of avoiding changes and keeping calls down, but it adds the constraint that all of your textures must be the same size. You're not going to use the same texture size for a small pebble or for a particle as you use for a brick wall, I hope. Aiming for the entire scene in a single call also constrains you you to using the same material properties for all of your objects. If you're happy with that tradeoff, then sure, go for it, but it really reduces your setup's general-purpose utility. You can't even do something as simple as enable alpha blending for a window but keep it disabled for everything else. That makes the objective something more of theoretical interest than practical utility. Well, we still have more than one texture units to work with. I assume 4 texture units, and hopefully GPUs don't ever drop below that number. What I do is put 4 to hundreds of textures (or normalmaps, heightmaps, etc) onto each texture (more or less "texture atlas" style). So my tcoords for a given object don't range from 0.000 to 1.000 on each axis, they range from some tiny fraction of that range. Of course my approach means I can't create repeating textures (for tiled floors and such) by letting the tcoords extend far < 0.000 and far > 1.000. Clearly I need to rethink my balancing act. You guys are probably correct that only putting local-coordinates into the VBOs for "large moving objects" is not the optimal tradeoff. But some comments seem to indicate that going whole-hog the opposite direction isn't very good either. In many games and simulations, the [vast/large/substantial] majority of objects are fixed. These objects probably render at the same speed with either local or world coordinates in the VBOs, everything else being done the same. Probably the simplest test I can perform is break my draw calls into as many objects are in each VBO and perform frustum tests on each. I can certainly measure how much CPU time that adds. Unfortunately, I'm not very proficient in figuring out the impact on GPU execution time.

maxgpgpu

207

Author

June 16, 2012 06:33 AM

So, maxgpgpu, let me get this straight. You have the most powerful hardware on the planet available to you - your GPU - but yet you're flat-out refusing to use it for the kind of processing it's best at. You have well-known, tried-and-tested solutions for collision and bboxes that have been proven to work for over a decade, but yet you're also flat-out refusing to use them. You have a VBO solution that involves needing to re-up verts to arbitrary positions in the buffer in a non-predictable manner - congragulations, you've just re-invented the worst case scenario for VBO usage - you really should profile for pipeline stalls sometime real soon.

None of this is theoretical fairyland stuff. This is all Real, this is all used in Real programs that Real people use every hour of every day of every week. You, on the other hand, have a whole bunch of theoretical fairyland stuff. It's not the case that you're a visionary who's ideas are too radical for the conservative majority to accept. It is the case that your ideas are insane.

Am I missing anything here?

Look, I don't know everything. I might be making mistakes, maybe serious ones. However, I do have reasons for every choice I make. Good reasons? Hopefully most of the time. But maybe not, and I'm totally open to that, and my reason for posting this thread was to scare up ideas that help me brainstorm. I can say one thing, in usual cases, the engine is plenty fast in normal cases. As I explained in this thread, for those cases, my approach is inherently GPU limited (or more exactly, not CPU or GPU limited).

And hey, I love having the GPU do instancing. I never criticized instancing. I know, because I never would. I also love having the GPU perform bumpmapping and parallax-mapping via relaxed cone-step mapping with self-shadowing to generate tons of fine details without need to have zillions of micro-triangles.

You talk about pipeline stalls like I never thought about that issue (one of many issues that interact, unfortunately). Yet my approach draws many objects with unlimited vertices with a single glDrawElements() call. And *****I***** don't care about pipeline stalls? While the conventional way only draws one object per draw call because it needs to change transformation matrices (and sometimes other state) between objects. To be honest, if I have one worry about my approach, it is that I pay too much attention to avoiding pipeline stalls.

kunos

2,256

June 16, 2012 06:45 AM

with unlimited vertices

ah.. now I know who you really are

Stefano Casillo
TWITTER: [twitter]KunosStefano[/twitter]
AssettoCorsa - netKar PRO - Kunos Simulazioni

maxgpgpu

207

Author

June 16, 2012 08:29 AM

[quote name='maxgpgpu' timestamp='1339696164' post='4949215']
See, this is the problem. And frankly, I can't blame you, because --- as I'm sure you're aware --- once any of us adopts a significantly non-standard approach, that radically changes the nature of many other interactions and tradeoffs. Since you and virtually everyone else has so thoroughly internalized the tradeoffs and connections that exist in more conventional approaches, you don't (because you can't, without extraordinary time and effort), see how the change I propose impacts everything else.

No, it doesn't take that much effort to evaluate other approaches -- there is no "conventional" apprach; we're always looking for alternatives. I've used methods like this on the Wii, where optimised CPU transform routines were better than it's GPU transforms (and also on the PS3, where it's got way more CPU power than GPU power), but without knowing the details of the rest of those specific projects, we can't judge here the cost/value of those decisions.[/quote]
Wow! First of all, thanks for your fabulous post. And let me correct myself and say that some people responding, but not you, appear to not be taking the time to consider the consequences of the approach I propose.

The problem is that your alternative isn't as great or original as you think it is.[/quote]
If I thought it was, I'd probably decide to keep it proprietary for as long as I could. In fact, I always new it adopted some "retrograde" elements. I think that's what sets some people off before they think everything through. I don't say that because "I'm necessarily right" about anything. I say that because of the rude and thoughtless way some people respond.

However, if we consider two extremes -- complete CPU vertex processing vs entirely GPU driven animation/instancing of vertices -- then you just have to consider that there is a continuous and interesting middle ground between the two extremes at all times. Depending on your target platforms, you find different kinds of scenes/objects are faster one way or the other. Different games also have different needs for the CPU/GPU of their own -- some need all the CPU time they can get, and some leave the CPU idle as the GPU bottle-necks. Our job should be to always consider this entire spectrum, depending on the current situation/requirements.[/quote]
I agree with you. I certainly hope nobody was thinking I want to return to the days of a frame buffer in CPU memory, and everything done bo the CPU! Ouch! When I started this thread, I tried to limit the context to next-generation CPUs and next-generation GPUs on a PC, with the minimum bar set at an 8-core CPU (say, FX8150) and GTX-580 ~ GTX-680 GPU. Of course this context tends to work against my proposed approach given the spectrum of possibilites you mention. I also tried to limit the context to a general-purpose game/simulation engine, which means it is stuck trying to efficiently handle a huge spectrum of games and simulations with a huge spectrum of requirements.

One reason for the approach I mentioned is my observation that most games and simulations contain 99% to 99.99% fixed objects (meaning, they do not rotate or move during a given frame). Here I'm not counting instanced objects like leaves and blades-of-grass blowing in the wind, which clearly are best handled with GPU instancing support. On these non-moving objects, I believe my approach is faster. This forced me to include separate support for "large moving objects", which are handled in the more conventional way (local-coordinates in GPU). I've mentioned all these things, but it doesn't seem to matter. It seems I offended the mainstream religion.

I understand why this seems counterintuitive. Once you adopt the "conventional way" (meaning, you store local-coordinates in GPU memory), you are necessarily stuck changing state between every object. You cannot avoid it. At the very least you must change the transformation matrix to transform that object from its local coordinates to screen coordinates.[/quote]
No, this is utter bullshit.
You haven't even understood "conventional" methods yet, and you're justifying your clever avoidance of them for these straw-man reasons? Decent multi-key sorting plus the simple state-cache on page#1 removes the majority of state changes, and is easy to implement. I use a different approach, but it still just boils down to sorting and filtering, which can be optimised greatly if you care.[/quote]
Okay, maybe I don't correctly understand what is conventional to you or others here. I don't entirely understand what you're saying here, but does what-you-say eliminate the need to supply a separate transformation matrix to the GPU for every object, or are you putting a whole slew of transformation matrices in a uniform buffer object (or texture, or something)... and somehow arranging things so the shader programs know which matrix to access for each vertex? If so, that's one way to address the problem my approach also addresses.

Moreover, there's no reason that you have to have "the transform matrix" -- it's not the fixed function pipeline any more. Many static objects don't have one, and many dynamic objects have 100 of them. Objects that use a shared storage tehcnique between them can be batched together. There's no reason I couldn't take a similar approach to your all-verts-together technique, but instead store all-transforms-together.[/quote]
That seems wise. Is that conventional practice now? Of course that still leaves some of us with the difference of opinion about where collision-detection and collision-response/physics, where many/most/all people here insist physics must be done without vertices (much less in a consistent inertial coordinate-system like world-coordinates).

However, your argument more-or-less presumes lots of state-changes, and correctly states that with modern GPUs and modern drivers, it is a fools game to attempt to predict much about how to optimize for state changes.

However, it is still true that a scheme that HAS [almost] NO STATE CHANGES is still significantly ahead of those that do. Unfortunately, it is a fools game to attempt to predict how much far ahead for every combination of GPU and driver!

I don't agree with how you've characterised that. I meant to imply that state-change costs are unintuitive, but that they are still something that can be measured. If it can be measured, it can be optimised. If it can't be measured, then your optimisations are just voodoo.

Also, just because you've got 0 state changes, that in no way grants automatically better performance. Perhaps the 1000 state-changes that you avoided wouldn't have been a bottle-neck anyway; they might have been free in the GPU pipeline and only taken 0.1ms of CPU time to submit? Maybe the optimisation you've made to avoid them is 1000x slower? Have you measured both techniques with a set of different scenes?[/quote]
Yeah, I did bencharms to test some alternatives, though I'm sure you understand there are many cases and "the answer" never applies to all of them. And furthermore, I hadn't implemented or handled every feature at the start, so these benchmarks could only be guidelines. One benchmark I performed was to figure out at what point to consider objects "large moving objects" (how large and how-often moving [recently]), which makes them get handled differently (local-coordinate in GPU). Will that point change over time? Most certainly. Will it vanish? That's more an architecture question, which is what I was trying to stimulate.

Certain facts simply exist. A modern GPU has about 512~1024 cores. If the engine is rendering a bunch of small objects, or far-away objects (meaning "few vertices" rendered per object), and a new transformation matrix gets uploaded between each object, then each object might not even keep all 512 cores busy for even an instant. That's gotta put a crimp in bandwidth. The GPU finishes the object in "no time", then all 512 cores are stalled while the CPU uploads a new transformation matrix [and maybe other state]. That's precisely what doesn't happen in the alternate approach, where dozens to hundreds of objects get rendered by a single glDrawElements(). However, if you have alternate ways to assure the GPU can keep on trucking without [many] breaks, that's great. Maybe I haven't given enough attention to figuring out those alternative ways, but I don't see what that should stimulate so much vitriol.

I really don't know what you mean when you say BattleField 3 instances everything. The only thing I can figure is... there are ZERO objects that only exist once. If you have a [Soldier0] object, then you have several instances of [Soldier0]... only with the kind of variations from one displayed-instance to another that instancing provides. If that's what you mean, then I understand what you mean. But that sounds like a mighty strange game! Having said that, I can certainly imagine lots of games where a large majority of displayed objects are instanced... blades of grass, trees, leaves, bricks, stepping stones, and so forth. Maybe you don't mean "everything" literally.

Maybe you don't want to waste VBO memory on storing a unique mesh for every "Soldier" in the game, so you only store a single (local space) "base mesh" that's kind of an average soldier. Your art team still models 8 unique soldiers that the game requires, as usual, but they are remapped as deltas from the base mesh. You then store your 1 base-mesh and 8 delta/morph-meshes together, and then instance 100 characters using the 1 base-mesh, each of which will be drawn as one of the 8 variations as required/indexed.[/quote]

I never said "no instancing". Where does anyone get that idea? I love instancing! I just never heard of a game or simulation where there were literally no unique objects. But hey, that's just little old me!

In your system, we not only need the original 9 data sets, but also enough memory for the 100 transformed data-sets. We then need enough GPU memory for another ~200 (n-hundred) transformed meshes, because it's dynamic data and the driver is going to want to ~double buffer (n-buffer) it.

I really don't know why I bothered with this, as you're being very dismissive of the real criticism that you solicited, but again, look at your memory trade-offs... The "conventional" requirements look like:
[attachment=9490:1.png]
whereas yours looks like:
[attachment=9491:2.png]

i.e. conventional memory usage
= Models.Verts * sizeof(Vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms

your memory usage
= Models.Verts * sizeof(vert) //local data
+ Instances * Model.Bones * sizeof(float4x4) //transforms
+ Instances * Models.Verts * sizeof(vert) //world data
+ Instances * Models.Verts * sizeof(vert) * N // Driver dynamic VBO buffering.[/quote]
I'm sorry you read someone else's false claim that my approach prohibits instancing. If you read my posts, you shouldn't get that impression. You would also find out that I render "large moving objects" by putting local-coordinates in the VBOs, because benchmarks (and common sense) show doing so is more efficient. Nonetheless, I'm glad you ran the numbers, because they are useful to stare at and consider.

If you're going next-gen, you're looking at huge object counts and huge vertex counts. In the first equation above, there is no relation between the two -- increasing instances does not impact vertex storage, and increasing vertices does not increase the per-instance storage cost. However, in the latter, they are connected -- increasing the amount of detail in a model also increases your per-instance memory costs. Seeing as modern games are getting more and more dynamic (where more things are moving every frame) and more and more detailed (where both vertex and instance counts are increasing greatly), I'd give your users the option to keep their costs independent of each other.[/quote]
This is indeed the kind of consideration that I think about, especially how a general-purpose engine can (automatically if possible) do the efficient thing with every object. To be sure, objects that exist many times are more efficient with GPU instancing. Hell, even two instances is justification if they contain lots of vertices.

e.g. the example I gave earlier -- of 100 animated characters, with 100 bones each, using a 1M vertex model -- clearly is much more efficient in the "conventional" method than in yours. So if you wanted to support games based around animated characters, you might want to support the option of a different rendering mode for them.
[/quote]
Yes, I need to keep my eyes on these situations. If they're "large moving objects" or "obviously instancable objects", then I already have special cases for them. What I don't have, but you made me think more about, is in-between cases like "making all (or most of) your soldiers (or other object) from a single template, and performing [potentially-tricky] instancing on them. If the environment won't fit in GPU memory, then clearly some forms of instancing approaches must be dreamed up to squeeze more into less GPU memory space. I haven't run into that yet, but that doesn't mean I won't.

One useful thing I learned in this thread is how much of an outlier I am when it comes to what I perform collision-detection and collision-response AKA physics upon. In most cases, I perform both processes on the full set of vertices with unique positions. Some people don't perform collision-detection and/or collision-response at all (because their game doesn't require it). Others perform collision-detection and/or collision-response on an AABB or OOBB, because (I guess), that's good enough for their game/application - or because they need their game/application to run on slower hardware and have no choice. Others define a separate vertex cloud strictly for collision-detection and/or collision-reponse, which they process on the CPU only "as needed" (when their AABB [approximations] overlap in broad-phase tests). Others (not here, but in my reading) have the GPU feedback world-coordinates of all objects back to CPU memory (or specify a feedback shader for only those objects that moved).

That's quite a wide range of approaches. Clearly my approach is an outlier compared to the people who posted in this thread, in that I perform collision-detection and collision-response on all the unique vertices in every object --- when called for. I mean, my broad-phase looks for overlap of object AABBs and doesn't look at individual vertices (though 3 lines of assembly-language update the AABB of the object it is part of, every time the object is transformed (because it rotated or moved)). But the highly optimized GJK algorithm that powers my narrow-phase collision-detection routine does walk the full set of unique vertices in both objects to detect intersections (or find closest features when no intersection is detected). Making this decision then became an important part of me deciding to transform rotated/moved objects on the CPU... so I'd have world-coordinates in CPU memory to perform collision-detection and also collision-response. Clearly I agree that IF I was satisfied with CPU transformed AABB test, or CPU transformed OOBB tests, or CPU transformed simplfied object tests... I might have made different choices.

Maybe I'm looking too far ahead. Or maybe the criticism of my choice to perform more detailed/accurate collision-detection [and physics] is correct. I mean, hey! I certainly understand that. I created two commercial video game engines on contract several years ago, and I was forced by the realities of throughput to adopt crude collision-detection strategies. In one the artists had to define a few extremity points that were most likely to intersect other objects. The purist in me hated that... but I could find no way out consistent with the performance I had available. The other engine was a bit better, but nowhere near what I'm planning now.

Who knows, maybe I overshot the mark. I put a lot of effort into writing the fastest possible broad-phase, GJK narrow-phase, and even a final phase for arbitrarily irregular "concave" objects (that only executes if the GJK hulls overlap, and only on object known to be concave (or not known to be convex)). Maybe all that effort was wasted, because that won't be practical on top-end CPU/GPU combinations for another 10 years.

I hope that's not the case. So far, it doesn't appear to be the case, even in tests with many simultaneously rotating and moving moderately high-vertex-count objects. I don't have physics done yet, so I just paint the individual intersecting triangles green as they pass through each other. The frame doesn't slow down until my test gets "rather absurd". Though a room chock full of bouncing socker balls might slow things down eventually when I'm forcing it to always run the concave-object phase too (whether the object is concave or not). Even so, I have more room to improve the concave algorithm based on "closest features" hints from the GJK routine on the same object pair the previous frame. All I need is more time!

maxgpgpu

207

Author

June 16, 2012 08:55 AM

[quote name='maxgpgpu' timestamp='1339828421' post='4949721']
with unlimited vertices

ah.. now I know who you really are

[/quote]
Right. I sell RAM to GPU makers! :-)

Okay, unlimited except by available GPU memory and 32-bit indices... though I admit to implementing all batches with 16-bit indices except when a single object has more than 65536 vertices.

kunos

2,256

June 16, 2012 08:58 AM

Okay, unlimited except by available GPU memory and 32-bit indices... though I admit to implementing all batches with 16-bit indices except when a single object has more than 65536 vertices.

whatever mate.
My suggestion to you would be to stop writing walls of text on the forums and use the energy to actually profile your code, that'll show you how wrong you are... or, if you really feel you're right, I reiterate the request for a video that showcase what this silly engine of yours is capable of doing... until then, it's all air.

Stefano Casillo
TWITTER: [twitter]KunosStefano[/twitter]
AssettoCorsa - netKar PRO - Kunos Simulazioni

maxgpgpu

207

Author

June 16, 2012 09:28 AM

[quote name='maxgpgpu' timestamp='1339836940' post='4949737']
Okay, unlimited except by available GPU memory and 32-bit indices... though I admit to implementing all batches with 16-bit indices except when a single object has more than 65536 vertices.

whatever mate.
My suggestion to you would be to stop writing walls of text on the forums and use the energy to actually profile your code, that'll show you how wrong you are... or, if you really feel you're right, I reiterate the request for a video that showcase what this silly engine of yours is capable of doing... until then, it's all air.
[/quote]
Look, I'm not here to impress anyone. That's not my purpose. But if you want to propose a specific simple situation, I'll try to generate it and see what happens. The engine is optimized for "procedurally generated content", which means objects assembled from full or partial primitive shapes like [partial] disks, [tapered] tubes, [truncated] cones, etc. So if you specify something that's made up of objects that can be built out of such primitives, that's fine. I can import 3DS too, though I haven't tested it thoroughly so I don't know what percentage of 3DS objects it will puke on. As I also said, physics is not done, but collision-detection is (broad-phase, convex-narrow-phase (GJK) and concave-narrow-phase). If you want this to provide honest opinions, I'm fine. If you just want another reason to trash someone, go find someone else.

kunos

2,256

June 16, 2012 09:37 AM

just do whatever you think this software of yours is good at. If it's good fine, if it isnt, we'll lol.. as simple as that. Some stats will be nice too.. eg, how many static triangles? how many dynamic triangles? how many skinned triangles? How many materials? How many shaders? How does it compare to a "traditional" approach? If you don't know these answers you really arent entitled to answer to a thread called "most efficient".. at all.

Stefano Casillo
TWITTER: [twitter]KunosStefano[/twitter]
AssettoCorsa - netKar PRO - Kunos Simulazioni

Hodgman

52,718

June 16, 2012 11:52 AM

One reason for the approach I mentioned is my observation that most games and simulations contain 99% to 99.99% fixed objects (meaning, they do not rotate or move during a given frame).

You can only make this observation with regards to a specific game/project.

On these non-moving objects, I believe my approach is faster[/quote]No, it depends on the scene. Your set-up would be great for certain types of scenes/games -- the ability for easy dynamic geometry with matching polygon collision/physics could probably be used for some neat game-play ideas...

Let's say we've got 1 unique 1M vert mesh, so instancing/etc are of no use. Your CPU transforming version still allocates 3M+ verts of memory (local, world, VBO[buffers]). That could be a deal-breaker right there. The GPU local version allocates 1M verts of memory.
I only pay the increased RAM cost if I really need complex/unpredictable vertex manipulation, such as: glBegin/glEnd-like APIs (e.g. for script-driven UI code), deformation/displacement effects, complex particle systems, advanced animation, etc...

In my experience, the "conventional approach" for static objects is to pre-transform them to world-space at compile-time, and then use a static VBO. The state-change/batching problems are solved in this compilation step, simillar to your version. These particular models wouldn't have a transform matrix.

Anyway, the point is that an engine should allow each game's graphics programmer to use all of these different vertex management techniques in different situations. Real (AAA commercial) game engines don't do everything for you, and leave room for the game team to implement techniques themselves.

For example, your technique *is* a part of the "conventional toolbox"! Many PS3 use an implementation just like you've described and I've diagrammed -- where the CPU computes vertex transforms and uploads a dynamic stream of unique verts for each object together -- because in this particular situation, from the wikipedia specs on the PS3, you can see it's got a boatload of CPU power, but a GPU that sucks at vertex processing (8 vertex pipees @550 MHz). N.B. not necessarily all objects would use CPU processing - only used where the RAM cost can be afforded.

Another example, let's say you were making a game set inside football stadiums. Seeing the player is always inside the playing-field, from above, the world is like a circle around you, and you're always looking at a pie-wedge of that circle from the inside.
A sports game might tell it's engine to pre-transform the stadium/world model into world-space (at compile time), but also to tri-list/strip-ify the model, and also to re-arrange the index buffer so that the indices of strips inside the same 2D pie-wedge appear next to each other. You could then record the indices of where each wedge begins and ends.
To illustrate: at runtime, the game might determine that wedge #2-#5 are visible, so you look-up that wedge #2 begins at index#1234 and wedge #5 ends at index #8765, so you issue a single draw-call for indices 1234 to 8765, and you've instantly drawn a good approximation of a frustum culled version the whole stadium (the 2d wedges visible).

Different games will take different approaches like this, because optimisations depend on situations.

I've mentioned all these things, but it doesn't seem to matter. It seems I offended the mainstream religion.[/quote]People are snapping at you when you're making dubious or false claims, that's about it.

Certain facts simply exist. A modern GPU has about 512~1024 cores. If the engine is rendering a bunch of small objects, or far-away objects (meaning "few vertices" rendered per object), and a new transformation matrix gets uploaded between each object, then each object might not even keep all 512 cores busy for even an instant. That's gotta put a crimp in bandwidth.[/quote]If you have small batches with fence-inducing state-changes between them, then yes, large parts of your GPU hardware will be idle. Changing a parameter isn't likely to be fence-inducing -- not since the GeForce7, where changing a parameter could be as bad as changing the entire shader program. So a lot of these concerns aren't are on very shakey ground -- a modern GPU wont stall between draw-calls when there's only trivial state differences between them, and the CPU costs are in fractions of a millisecond.
N.B. the GPU's idea of batches is different to our draw-call commands! Different parts of the GPU pipeline will determine different begin/end markers based on different changes in state or situation (such as buffer status, etc). Where the GPU does decide to begin/end a batch section, it's only a performance issue you're spamming it with these changes more rapidly than it's pipeline can deal with. There are profiling tools to diagnose this.

Also, GPU execution is quite dynamic, e.g. your pixel shader texture fetches take quite a long time to arrive vs the ALU instructions around them, which means many of the ALU ops can stall waiting for a texture input. The GPU can basically break this processing up into "passes" of a shader program, where it can save it's progress and switch to some other task (such as other pixels/vertices). so e.g., When using very texture-fetch-heavy pixel-shaders, you could get some vertex ALU processing "for free" where the GPU fills gaps.

In the case where you are making large state changes, and also rendering many small objects (so that those large changes can't be hidden in the pipeline), then yes, you have to look at ways of merging batches. There's a wide continum of techniques to choose from, many of which maintain local-space static VBOs.

The GPU finishes the object in "no time", then all 512 cores are stalled while the CPU uploads a new transformation matrix[/quote]No, the GPU and CPU aren't tightly synchronised like that; the draw commands and shader constant bindings are both just bytes that the CPU writes into the GPU's command buffer. Also, as above, the GPU has plenty of other tasks/passes buffered up to throw cores at.
The CPU is usually a whole frame ahead, so when the GPU needs that transformation matrix, the GPU's front-end (the bit reading the command buffer) has already read out the transform and put it in the right cache. These state-changes that you're concerned about are optimised for in the GPU pipeline (as long as you use some general optimisations and keep your artists/tech-artists in line).

That's precisely what doesn't happen in the alternate approach, where dozens to hundreds of objects get rendered by a single glDrawElements(). [/quote]You're right though in that if the GPU stalls because the CPU is stuck uploading data, then that's a really bad thing -- if the GPU is usually a frame behind, then if it's stalled, your performance is out by a whole frame (e.g. 33ms lag behind schedule).

Going back to the PS3, if you wanted to imeplement software vertex processing in order to make use of that CPU processing power, you'd have to dedicate enough spare VRAM to buffer all your world's verts. On the PS3, this could be an issue, given it's tiny amount of RAM -- you might thing PC can ignore this, but as GPU's have gotten more RAM, scene vertex counts have also gone up.
Anyway, so we allocate a ring-buffer to store all the CPU-processed co-ordinates before the GPU consumes them. There's (at least) a frame of latency sending commands to the GPU, so our ring-buffer has to be big enough to store a whole frame's worth of vertices, so we need to specify a worst-case, maximum vertex count. If this count is exceeded, then the ring fills up and the CPU has an occasional 8ms stall for seemingly no reason while you wait for your VBO's to flush through to the GPU! Pushing large amounts of dynamic data not only requires a lot of RAM, but also has a good chance to cause these stalls you're trying to avoid.

. 22 Racing Series .

_the_phantom_

11,263

June 16, 2012 12:21 PM

Certain facts simply exist. A modern GPU has about 512~1024 cores. If the engine is rendering a bunch of small objects, or far-away objects (meaning "few vertices" rendered per object), and a new transformation matrix gets uploaded between each object, then each object might not even keep all 512 cores busy for even an instant. That's gotta put a crimp in bandwidth. The GPU finishes the object in "no time", then all 512 cores are stalled while the CPU uploads a new transformation matrix [and maybe other state]. That's precisely what doesn't happen in the alternate approach, where dozens to hundreds of objects get rendered by a single glDrawElements(). However, if you have alternate ways to assure the GPU can keep on trucking without [many] breaks, that's great. Maybe I haven't given enough attention to figuring out those alternative ways, but I don't see what that should stimulate so much vitriol.

The problem is your 'facts' are wrong.
Utterly.

As we are talking about next generation GPUs lets take one I know about; the AMD Southern Islands aka Radeon HD7000 Series.
A 7970, the card I have, has 2048 'cores' however when you issue a draw call the GPU doesn't go 'all 2048 threads, do this!' instead it operates on wavefronts (NV calls them 'warps', the general term is 'work groups').

AMD breaks the SI core up into 'compute units'.
Each compute unit has for SIMD units in it.
Each SIMD unit can deal with upto 10 wave fronts
Wave fronts are made up of 64 work items.

So each compute unit can have upto 2560 work items in flight. (And if I'm reading my clinfo output correctly the HD7970 has 32 such units)

But, each of these don't have to come from the same draw call nor do they have to be the same shader as each wave can have its own instruction buffer and data.

Now add in what Hodgman said about the async nature of the GPU and CPU (by default I believe the drivers can buffer up to 3 frames ahead) and you can see that your outline above simply isn't true - you don't issue a call which takes up the whole device and then waits, you build a command stream which can take up sections of the device with threads being swapped in and out as resources demand it.

21st Century Moose

13,459

June 16, 2012 01:25 PM

The GPU finishes the object in "no time", then all 512 cores are stalled while the CPU uploads a new transformation matrix

This is the fundamental flaw from which everything else proceeds. Uploading a new transformation matrix doesn't stall the GPU - the CPU just puts a tiny amount of new data into a command buffer, and the GPU gets around to reading it back out in it's own sweet time. There is no stall.

Updating a VBO on the other hand has potential to be very very different - that's why keeping VBOs static is important. You get considerably more data going across, and if the VBO is currently in use for drawing (owing to the 3 frames which the CPU is allowed to get ahead) then it will stall. That's why there are approaches for this - such as the well-known discard/no-overwrite pattern - that are designed to work around these stalls and keep both processors moving.

Which feeds into the final point. There's a reason why conventional approaches are conventional, and that's because they are the approaches that are proven to work.

Direct3D has need of instancing, but we do not. We have plenty of glVertexAttrib calls.

most efficient general rendering strategies for new GPUs

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

most efficient general rendering strategies for new GPUs

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines