Jump to content
  • Advertisement
FantasyVII

Rendering 1,000,000 sprites. I'm stuck profiling

Recommended Posts

9 hours ago, zhangdoa said:

CPU related questions: How's about the cost of MapBuffer() in Debug build? What's the difference between the different compiler optimization level? Do you need to consider about to optimize the O(m*n) for-loop? Do you have to submit the data of every single sprite every frame? Could you identify and optimize out any unnecessary temporary variables like the return value of std::vector<T>::size(), or any unnecessary and expensive copy construction?

GPU related questions: What's the buffer usage pattern (GL_STATIC_DRAW/etc) you specified when create and upload the vertex buffer? What's the mapping flag (GL_MAP_PERSISTENT_BIT/etc) you specified when map the vertex buffer? How do you handle the CPU-GPU synchronization between CPU-write operation and GPU-read operation? Is there any double buffering/triple buffering you've implemented? 

And could you share the blog post you're referencing to?

2

 

MapBuffer in Debug build with 1 mill sprites is much worse, I get around 4 frames. Where in the release I get 16 frames.

I have to modify the sprites every single frame, yes. All 1 million of them.

I did optimize out std::vector<T>::size() and few little things, and in the grande scheme of things they only improve my frame rate by 1 frame.

For my index buffer, I use GL_STATIC_DRAW since I don't need to modify the index buffer after creation at all. I allocate the buffer and fill it up once.

For my vertex buffer, the creation flags are

GLCall(glBufferStorage(GL_ARRAY_BUFFER, size, NULL, GL_DYNAMIC_STORAGE_BIT | GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT));

 

for mapping I use

GLCall(return glMapBufferRange(GL_ARRAY_BUFFER, 0, size, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT));

I don't handle any synchronization between the CPU and GPU since from my understanding GL_MAP_PERSISTENT_BIT takes care of that.

 

I haven't done any double or triple buffering. I do want to do that though, but I am not sure how to go about it. I don't know how to implement it. I need to do a bit more research about that.

 

Here is the blog post

http://voidptr.io/blog/2016/04/28/ldEngine-Part-1.html

 

9 hours ago, Green_Baron said:

I haven't done this myself yet (apart from one or two tutorials), but i'd look into particle systems (instanced rendering to the rescue), i think you can easily get a much higher frame rate than you have now; with 4 static vertices / 6 indices.

In instanced rendering you have the instance id to variate the theme, e.g. lookup individual data from a uniform buffer object.

Also, one million quads are difficult to display on a 2k or even 4k monitor. Would there be a possibility to do some preselection ?

 

Edit:

The super bible has an example for rendering 1 million grass blades at high frame rate. They write the individual data (changes in the geometry, orientation, colour, texture) of each grass blade into textures instead of a large buffers and use the instance id in the shader to look up the individual data from the textures. One can even make it wave and sway in the wind. Maybe that's where you can look up more ...

7

resolution is not the issue here. I know for a fact that I can get 1 million quads rendered without using instancing. although I will implement instancing later on. For now I need to solve the CPU bottleneck that I have

 

7 hours ago, Eternal said:

A large number of profiling samples on an opening brace like that points to a stall of some kind. In this case there seems to be a lot of pointer chasing going on here:


MapRectangleShapeBuffer((RectangleShape*)renderLayerManager.renderLayers[i]->renderables[j]);

From the cast it looks like renderables is something like std::vector<Renderable*>, so for each iteration you're probably loading from main memory instead of from cache. And then it will also re-load the RenderLayer and the renderables vector/array again and again, unless the compiler can prove that nothing could possibly change them between iterations.

To fix it: Store your "Renderables" by value, If you have different types of Renderables keep them in seperate vectors. Your for loop should look something like this:


for(const auto& renderLayer : renderLayerManager.renderLayers)
{
	for(const auto& sprite : renderLayer.sprites) //sprites is std::vector<RectangleShape>
	{
		MapRectangleShapeBuffer(sprite);
	}
}

 

I suspected it would be cache miss issue. I thought vector of pointers will have the data contiguous in memory. I was wrong. That only happens if you have a normal vector of objects. I verified that. After changing this vector from this

std::vector<IRenderable*> renderables;

to this

std::vector<RectangleShape> rectangleShapes;

 

my frame rate went from 16 frames to 23 frames. So I gained 7 frames. That is great. Thank you for the tip!!!

Still, though my CPU is bottlenecked. I suspect that my CPU is still stalling because I have a pointer to my transform component in my RectangleShape class. I will move my position data from my transform component class to the sprite itself and test it and see how many frames I gain.

My usage on the curly brace went from 69% to now it is at 56%

358745161_profilernew.thumb.PNG.d23c46546c714f5062c9f8e3a46c29b4.PNG

 

But there is still an issue.

Currently, my renderer can render the following things,

-LineShape

-RectangleShape

-Sprite

-Text

They all inherit from the base class Renderable. I need to have a vector of pointers. I can't do it as a vector of objects. the reason for that is I also sort all my renderable based on their zSortingOrder. So I can't have 4 different vectors of objects for LineShape, RectangleShape, Sprite, and Text.

So that is one issue. However, I do need to keep my data contiguous in memory. As far as I know, std::vector<pointer*> has no way of making this happen (correct me if I am wrong) so that leaves me to having to create my own custom allocator. Which is going to be a pain because I am also using std::sort(). If I create my custom allocator I need to create my own implementation of quicksort to sort my renderable. unless I can somehow use std::sort().

Anyway, so far I know one of my performance issues are that my data is not contiguous in memory and that is killing my performance. I will reimplement my engine API to account for that.

 

5 hours ago, ProfL said:

It's good you've started your optimziation by profiling, there are so many that just guess.

But I'd suggest you to use a better profiler, you can try CodeXL or VTune, both will tell you the cost of every module and every assembly line. Both offer you sample based profiling or hierarchical, hence you can find expensive lines (e.g. some cache misses that might cause all the trouble) or a global overview.

I will have a look at them. Thank you!

 

4 hours ago, Green_Baron said:

 

Well, if a different algorithm/technique than the one used is 10 times faster by itself then the best profiler won't help ... it says the mapped buffer costs time, which is not surprising .... just sayin', no offence.

I only map the buffer once and then I use it. So mapping is not the thing that is killing my performance.

Edited by FantasyVII

Share this post


Link to post
Share on other sites
Advertisement

I too render 2 million triangles at 3,000fps with 9 texture lookups per vertex, full debug info and debug context, that is not the point. You have pulled all registers OpenGL offers to make the buffer slow, possibly convincing the driver to let it reside in CPU memory or copying around data most of the time.

The example i am talking about can be found here. It only neds a small uniform buffer and offers quite some flexibility through the use of textures for parameter. I don't have the time to integrate it myself now because my framework undergoes a major evolution. Maybe, hopefully the link helps.

Edited by Green_Baron

Share this post


Link to post
Share on other sites
On 9/2/2019 at 10:30 PM, FantasyVII said:

So that is one issue. However, I do need to keep my data contiguous in memory. As far as I know, std::vector<pointer*> has no way of making this happen (correct me if I am wrong) so that leaves me to having to create my own custom allocator.

The vector itself is contiguous, but if the data you point at is not, you're still in trouble. You could make vectors of the things you point at as well, to keep that contiguous (and different vectors for different sub-classes can be done, at the cost of adding another data stream that the CPU must read). Also, for optimal results, you should read that data as sequential as you can get it, so you benefit from pre-fetching.

As said somewhere above already, if you do the above, do you still need the pointer vector? Wouldn't it be feasible to push all the data vectors to the gpu instead directly instead of doing that through the pointers?

Share this post


Link to post
Share on other sites

With map persistent and map contiguous OpenGL is instructed to make every single write operation to the buffer known to the driver immediately so that a subsequent draw call has the new data. The buffer is mapped only once but it seems to me that it is being written to continuously in a nested loop with multiple writes per sprite. The resulting copy operations from cpu to gpu mem (or the lookups directly on gpu mem in case the buffer is residing there, as if the driver had drawn the client buffer storage option) are imo what is causing the drag.

tl,dr: Imo, it is not the mapping alone, but each call to buffer storage that brings the performance down. Solution: prepare data and send it once. E.g. with textures for the parameters.

Edited by Green_Baron

Share this post


Link to post
Share on other sites

It does not work as you imagine it does. If that was the case, persistent buffer would make no sense.

In THEORY: You can transfer around 5GB/s via persistently mapped buffers, at 60fps and 1M sprites, that's around 89byte/sprite. 

Without proper profiling, it's just readying crystal balls what "probably might be" the reason for the slow down.

Share this post


Link to post
Share on other sites

I have it from the documentation of glMapBufferRange ...

GL_MAP_PERSISTENT_BIT indicates that the mapping is to be made in a persistent fassion and that the client intends to hold and use the returned pointer during subsequent GL operation. It is not an error to call drawing commands (render) while buffers are mapped using this flag. It is an error to specify this flag if the buffer's data store was not allocated through a call to the glBufferStorage command in which the GL_MAP_PERSISTENT_BIT was also set.

GL_MAP_COHERENT_BIT indicates that a persistent mapping is also to be coherent. Coherent maps guarantee that the effect of writes to a buffer's data store by either the client or server will eventually become visible to the other without further intervention from the application. In the absence of this bit, persistent mappings are not coherent and modified ranges of the buffer store must be explicitly communicated to the GL, either by unmapping the buffer, or through a call to glFlushMappedBufferRange or glMemoryBarrier.

... and glBufferStorage:

GL_MAP_PERSISTENT_BIT

    The client may request that the server read from or write to the buffer while it is mapped. The client's pointer to the data store remains valid so long as the data store is mapped, even during execution of drawing or dispatch commands.
GL_MAP_COHERENT_BIT

    Shared access to buffers that are simultaneously mapped for client access and are used by the server will be coherent, so long as that mapping is performed using glMapBufferRange. That is, data written to the store by either the client or server will be immediately visible to the other with no further action taken by the application. In particular,

        If GL_MAP_COHERENT_BIT is not set and the client performs a write followed by a call to the glMemoryBarrier command with the GL_CLIENT_MAPPED_BUFFER_BARRIER_BIT set, then in subsequent commands the server will see the writes.

        If GL_MAP_COHERENT_BIT is set and the client performs a write, then in subsequent commands the server will see the writes.

        If GL_MAP_COHERENT_BIT is not set and the server performs a write, the application must call glMemoryBarrier with the GL_CLIENT_MAPPED_BUFFER_BARRIER_BIT set and then call glFenceSync with GL_SYNC_GPU_COMMANDS_COMPLETE (or glFinish). Then the CPU will see the writes after the sync is complete.

        If GL_MAP_COHERENT_BIT is set and the server does a write, the app must call glFenceSync with GL_SYNC_GPU_COMMANDS_COMPLETE (or glFinish). Then the CPU will see the writes after the sync is complete.

 

Share this post


Link to post
Share on other sites
On 9/2/2019 at 3:30 PM, FantasyVII said:

 

MapBuffer in Debug build with 1 mill sprites is much worse, I get around 4 frames. Where in the release I get 16 frames.

I have to modify the sprites every single frame, yes. All 1 million of them.

I did optimize out std::vector<T>::size() and few little things, and in the grande scheme of things they only improve my frame rate by 1 frame.

For my index buffer, I use GL_STATIC_DRAW since I don't need to modify the index buffer after creation at all. I allocate the buffer and fill it up once.

For my vertex buffer, the creation flags are


GLCall(glBufferStorage(GL_ARRAY_BUFFER, size, NULL, GL_DYNAMIC_STORAGE_BIT | GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT));

 

for mapping I use


GLCall(return glMapBufferRange(GL_ARRAY_BUFFER, 0, size, GL_MAP_WRITE_BIT | GL_MAP_READ_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT));

I don't handle any synchronization between the CPU and GPU since from my understanding GL_MAP_PERSISTENT_BIT takes care of that.

 

I haven't done any double or triple buffering. I do want to do that though, but I am not sure how to go about it. I don't know how to implement it. I need to do a bit more research about that.

 

Here is the blog post

http://voidptr.io/blog/2016/04/28/ldEngine-Part-1.html

 

resolution is not the issue here. I know for a fact that I can get 1 million quads rendered without using instancing. although I will implement instancing later on. For now I need to solve the CPU bottleneck that I have

 

I suspected it would be cache miss issue. I thought vector of pointers will have the data contiguous in memory. I was wrong. That only happens if you have a normal vector of objects. I verified that. After changing this vector from this


std::vector<IRenderable*> renderables;

to this


std::vector<RectangleShape> rectangleShapes;

 

my frame rate went from 16 frames to 23 frames. So I gained 7 frames. That is great. Thank you for the tip!!!

Still, though my CPU is bottlenecked. I suspect that my CPU is still stalling because I have a pointer to my transform component in my RectangleShape class. I will move my position data from my transform component class to the sprite itself and test it and see how many frames I gain.

My usage on the curly brace went from 69% to now it is at 56%

358745161_profilernew.thumb.PNG.d23c46546c714f5062c9f8e3a46c29b4.PNG

 

But there is still an issue.

Currently, my renderer can render the following things,

-LineShape

-RectangleShape

-Sprite

-Text

They all inherit from the base class Renderable. I need to have a vector of pointers. I can't do it as a vector of objects. the reason for that is I also sort all my renderable based on their zSortingOrder. So I can't have 4 different vectors of objects for LineShape, RectangleShape, Sprite, and Text.

So that is one issue. However, I do need to keep my data contiguous in memory. As far as I know, std::vector<pointer*> has no way of making this happen (correct me if I am wrong) so that leaves me to having to create my own custom allocator. Which is going to be a pain because I am also using std::sort(). If I create my custom allocator I need to create my own implementation of quicksort to sort my renderable. unless I can somehow use std::sort().

Anyway, so far I know one of my performance issues are that my data is not contiguous in memory and that is killing my performance. I will reimplement my engine API to account for that.

 

I will have a look at them. Thank you!

 

I only map the buffer once and then I use it. So mapping is not the thing that is killing my performance.

Quote

void setValueToFiveWithReference(int& x)
{
    x = 5;
}

I'm pretty new to C++, had programmed in java, does that make a difference vs the other way around like you did, int* pointer

 

^wasn't suppose to be inside the quote and wont let me change the quote

Edited by tolazytbh

Share this post


Link to post
Share on other sites

Am playing around with an example for a particle system and recalled this here problem.

There, a dynamic buffer is mapped persistently but flushed explicitly. Maybe you could try instead of creating (and storing into) the buffer range with map coherent bit set (which imo generates a lot of traffic), besides GL_MAP_WRITE_BIT and GL_MAP_PERSISTENT_BIT set the GL_MAP_FLUSH_EXPLICIT_BIT (and nothing more if you don't read from the buffer). After having finished all buffer operations and before drawing, call glFlushMappedBufferRange().

Maybe it helps a bit ...

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

GameDev.net is your game development community. Create an account for your GameDev Portfolio and participate in the largest developer community in the games industry.

Sign me up!