OpenGL Bindless Textures and multiple GL contexts => problem?

Started by
11 comments, last by The Scytheman 6 years, 1 month ago

Hi!

There's something strange going on with my OpenGL bindless textures test performance-wise.

(My system is Windows 7 Embedded, 16 GB RAM, 10GB NVIDIA Geforce 1080Ti with driver version 388.00)

I've been testing OpenGL bindless textures as part of our 3-D engine. My app creates 512 pieces of 1024x1024 RGBA8 textures with a single mip level and fills them with constant data. After this, it gets the bindless texture handles for each texture and calls glMakeTextureHandleResidentARB on them. After the operation each of my 512 textures return true for the query glIsTextureHandleResidentARB. Also, my 2 gigabytes of textures is reflected on the GL_GPU_MEMORY_INFO_CURRENT_AVAILABLE_VIDMEM_NVX query as expected. This process is performed once in my frame loop. I never touch them after this init phase is done. My test doesn't even draw anything using them. The engine in which this test is incorporated does draw a lot of other stuff, but not using those textures.

So now I have 2 gigabytes of bindless textures resident.

The engine I'm working uses two GL contexts. One for the window on the desktop and one for a hidden window. Basically the hidden one is used when rendering to offscreen targets and the window context is used when displaying the final result to the user.

When my 2 gigabytes of textures are resident, wglMakeCurrent switching between those two contexts costs a lot. It costs around 0.4-0.5 milliseconds. With all the necessary switches during a frame that can total 1-1.5 milliseconds which is a lot for a 60 Hz app. Without bindless textures resident a GL context switch costs at most 0.1 ms.

The GL context switch cost depends on the total size of the textures so it must be doing something on a texel-level. I profiled the app using AMD CodeAnalyst. When my textures are resident the module "dxgmms1.sys" lights up taking 3% of profiler samples. The module contains stuff named "VidMmInterface" and this module is used both by my app and the kernel (PID 4). So I guess it's doing some video memory management on the context switch. But why? There's plenty of video memory available with gigabytes to spare when the textures are resident.

My test makes them resident only on one of the contexts. If I make them resident on both contexts, the cost doubles.

So my bindless textures incur a cost that's paid on every switch between the two contexts after they have been made resident. If they are resident in a context that's never made current, then there's no cost.

EDIT: The two contexts are on the same share group (wglShareLists)

Has anyone ever used bindless textures with multiple GL contexts? Have you noticed performance problems?

Best regards,
  Jani

Advertisement

I'm not sure I understand why you need two contexts in order to render offscreen. Are these contexts rendering at the same time on different threads?

My (wild) speculation here would be that the driver is forcibly synchronising every one of your textures whenever you switch contexts, because it's worried you might be modifying them from two threads at the same time...

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

as it was said before by swiftcoder, don't really know why you need another context just for offscreen rendering... multiple contexts are useful to do some basic multithreading in opengl, but in that case, you would just have one context per thread and that context is always current in that particular thread, so no context switching is needed

Thanks! Yeah, obvious question: why 2 contexts? Let's say it's an old architectural decision that hasn't been a performance problem before, but seems to be with bindless textures. The original idea was to be able to support N windows with a single hidden window taking care of rendering shared textures. Converting the N=1 special case to a single GL context is an option that is being entertained, but first I'd like to understand what's wrong with my setup at the moment. So back to the original question.

The app is doing multithreading, but the two GL contexts are created and used from the same thread so the GL only sees commands coming from one thread.

As for Swiftcoders idea about the driver forcibly synchronizing all of the textures: why doesn't it do that for the several gigabytes of non-bindless textures that are being used by the frameloop. It's only the bindless test textures that are causing this. The idea of bindless was to lower the GL driver overhead by letting the user control textures more directly. Now it seems that with two contexts the driver overhead is greatly increased.

7 hours ago, The Scytheman said:

It's only the bindless test textures that are causing this.

Even money says it's a driver bug. Which isn't terribly surprising, you are at the intersection of two things that don't see heavy use, bindless textures and shared contexts.

It's worth mentioning that in a Windows-only product, you could likely get away with a lot more under DirectX than OpenGL. The drivers tend to be a little more robust :/

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

If you have shared resources( implied given that you are using shared context ), then the driver have to ensure a consistent view of each resources when each context becomes active. The only way to do this is through some form of synchronization which I think others have pointed out. This goes for all shareable resources...iirc the specification points this out to. Without this automatic 'synchronization' the driver cannot be ensure coherency as with what you mentioned its possible that one context may be modifying the resource while one another is reading it( which means a multi-threaded setup). If you are not using multiple thread, then having multiple context really makes no sense...as a single context would work fine since each window will supply its own device context which is all that *MakeCurrent cares about. Even in this case there will still be a price to pay for calling *MakeCurrent when switching windows. If you application is multithreaded then there is no need to call *MakeCurrent more than once to initialize the context on that thread as once that association is made it never changes unless you manually call *MakeCurrent on the SAME thread with a different context.

cgrant, in this case no context is modifying the texture data in those textures. After the textures have been made resident in one of the contexts, no draw commands are ever issued which use those textures let alone modify them. The test is just to make them resident and see what happens. And they are made resident exactly once and left alone after that, but their cost is paid every frame after that. Performance problems happened. :) Of course, if the GL is being overly conservative (i.e. assuming multithreaded use and not monitoring whether data is actually being changed by a context) with the synchronization then ok then that could be an explanation. But again that makes bindless textures different, because this two-context scheme has been in place in our engine for years and it hasn't been a performance issue before.

On 2/17/2018 at 11:30 AM, The Scytheman said:

in this case no context is modifying the texture data in those textures. After the textures have been made resident in one of the contexts, no draw commands are ever issued which use those textures let alone modify them.

Driver may not be smart enough, or just make the lazy assumption that all shared resources must be sync on context switch whether or not they are dirty. If this worked previously without issue then its most likely a driver change that brought about the issue.

It's not the driver version. My problem has been in existence since "bindless" became a thing. 

I've now removed wglMakeCurrent calls from my frame loop and there's only a single GL context. None are performed after init. Those were costly when my test textures were resident in the bindless sense. 

Still, the bottom line is that if you have a lot of texture data persistently resident you pay a price for it on virtually every GL call you make as long as they are resident. I can see this because we have built-in profilers for the cpu and gpu inside our engine.  At the moment it seems to be about 0.4ms extra for my 512 textures. This doesn't seem to affect the time things take on the GPU side just the GL calls on the CPU side.

Maybe the issue here is that the rest of the engine isn't leveraging bindless textures. Maybe the cost of residency will be more than offset by actually leveraging them in the draw passes by using multidraw etc. and avoiding binds.  This would mean my simple test's effect on performance isn't really an issue. But this is a huge maybe. 

1 hour ago, The Scytheman said:

It's not the driver version. My problem has been in existence since "bindless" became a thing.

Just because your driver works fine for normal OpenGL features, doesn't mean there isn't a bug in its bindless texture implementation. I don't see much evidence of bindless textures seeing a huge amount of use in the wild.

Tristam MacDonald. Ex-BigTech Software Engineer. Future farmer. [https://trist.am]

This topic is closed to new replies.

Advertisement