Installing a second "development" video card?

Started by
14 comments, last by L. Spiro 11 years, 10 months ago
Yes it is possible to have AMD and NVidia installed on the same Windows 7 PC. Tho, I don't have any video-out connected, I just use the 2nd card to run opencl without stalls on the main window.

Regarding performance, it's less difference than you'd expect, although the HW works in a complete different way, that's because
1. modern GPUs also reject occluded surfaces, this doesn't work perfectly as it's dependent on the draw order, but
a) it rejects still a lot of faces
b) you can make a z-prepass on PC to simulate perfect culling like on iOS devices
c) even if the PowerVR is more efficient in find out what needs to be done, doing it is still slow, you'll be very memory bound most of the time (at least my experience)
2. most optimizations that you do and that help on PC, also help on iOS. like
a) optimizing draw order, shader, state changes
b) reducing data by using compression e.g. 16bit floats for vertexbuffers, texture compression to reduce bandwidth
c) reducing clears, FBO switching, using optimal formats (e.g. 16bit for shadows).
3. iOS devices have also different performance levels, 4generation of phones, 3 of pads and a dozen of iPods, you will want to keep things scaleable, especially as an indie dev who cannot affort testing on all devices (and optimizing for those). So if you can get a smooth performance on PC by e.g. dynamically adjust LOD distances, limit particle overdraw, balance the amount of texture/VBOs update per frame etc. you will benefit across the line on all other platforms as well.

There are just a few very specific optimizations that are down the hardware, most are generic.
Advertisement

There are just a few very specific optimizations that are down the hardware, most are generic.

While many of these points are true for both platforms, not all are. One thing worth pointing out is that OpenGL drivers for PC are often quite shoddy since vendors are focusing more on DirectX, so it is quite hit-and-miss when saying how much of an impact certain things make. For example, sorting by shader first and textures second could be faster on one PC and slower on another.
Luckily iOS devices are more consistent, and it is always better to sort by textures first, then shaders.
This is true for my PC as well, but on PC I add a 3rd condition when both textures and shaders are the same: depth. This extra condition actually slows down iOS so it is worth mentioning.

The biggest difference that pops out to me is in regards to clearing.
Clearing buffers on iOS devices just sets a flag. It doesn’t actually copy memory over the whole buffer etc.
It is instantaneous as long as it does not cause a resolve. You can avoid resolves by calling glDiscardFramebufferEXT() before glClear().

This means that what makes for a long operation on PC is virtually free on iOS.



One of the most important things to avoid on your life when working on iOS is dynamic/streaming VBO’s. It takes only a handful of updates per frame (around 6) to halt your game to a crawl (around 10 FPS). The same number of updates on the same vertex buffers without using a VBO allows you around 200 updates per frame before it even drops below 60 FPS. Without GL_DYNAMIC_DRAW VBO’s, I can update about 1,000 buffers per frame before it starts to slow to 10 FPS.

This applies even if you are double-buffering the VBO’s that you update, although to a slightly smaller extent. In one of our games, changing from GL_DYNAMIC_DRAW VBO’s to having no VBO’s increased the FPS from 21 to 32. That is a 16.369-millisecond difference, and there were only 4 or 5 VBO’s being updated.

This is not so hard on the PC side, although PC’s still benefit from not using VBO’s for GL_DYNAMIC_DRAW or GL_STREAMING_DRAW a bit.



On the one hand you can argue that low-end PC hardware can still give you a rough idea of what to expect on iOS devices due to many optimizations being “a good idea” for both platforms.
On the other hand you might still get hit by something that hits iOS harder than it does PC (dynamic/streaming VBO’s, dependent texture reads) or you might entirely miss opportunities to optimize for iOS specifically (discarding before clearing) simply because your only experience with iOS was a port. You didn’t get into the meat of the system and really learn how it works and how to exploit it for best performance.



There is no substitution for a real device.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid


[quote name='Krypt0n' timestamp='1340994631' post='4954030']
There are just a few very specific optimizations that are down the hardware, most are generic.

While many of these points are true for both platforms, not all are. One thing worth pointing out is that OpenGL drivers for PC are often quite shoddy since vendors are focusing more on DirectX, so it is quite hit-and-miss when saying how much of an impact certain things make. For example, sorting by shader first and textures second could be faster on one PC and slower on another.
Luckily iOS devices are more consistent, and it is always better to sort by textures first, then shaders.[/quote]
Actually on all platform you should first sort by shader (at least if overdraw is not an issue e.g. due to a z-pass). it's not related to drivers, it's how hardware works.
e.g. unity optimization report:
[quote ]
Rendering order of opaque geometry
• Tegra: big occluders first -­ front to back, rest by shader
• iOS: sort by shader[/quote]
http://blogs.unity3d...n_Unite2011.pdf

This is true for my PC as well, but on PC I add a 3rd condition when both textures and shaders are the same: depth. This extra condition actually slows down iOS so it is worth mentioning.
changing depth order is completely irrelevant on iOS.


The biggest difference that pops out to me is in regards to clearing.
Clearing buffers on iOS devices just sets a flag. It doesn’t actually copy memory over the whole buffer etc.
It is instantaneous as long as it does not cause a resolve.
on PowerVR hardware, a clear is simply a big triangle covering the whole screen. I think you confuse it to be the same as resolve. while it affects the resolve behavior, it still has some cost.


This means that what makes for a long operation on PC is virtually free on iOS.
please don't teach people that, that's simply not true. it has the exact same cost as drawing a full screen triangle on iOS.

on PC it depends on what GPU you have, if you use one that has HiZ, it will just invalidate some areas



One of the most important things to avoid on your life when working on iOS is dynamic/streaming VBO’s. It takes only a handful of updates per frame (around 6) to halt your game to a crawl (around 10 FPS). The same number of updates on the same vertex buffers without using a VBO allows you around 200 updates per frame before it even drops below 60 FPS. Without GL_DYNAMIC_DRAW VBO’s, I can update about 1,000 buffers per frame before it starts to slow to 10 FPS.[/quote] I think the PowerVR whitepaper tells it differently.
You should never update buffers once you start drawing, as the PowerVR gpu is deferred, this means, it keeps all buffers until the end of the frame, so when you lock, it will have to allocate new memory etc. But you can update all buffers before you issue the first draw call without affecting the GPU/driver.


This applies even if you are double-buffering the VBO’s that you update, although to a slightly smaller extent. In one of our games, changing from GL_DYNAMIC_DRAW VBO’s to having no VBO’s increased the FPS from 21 to 32. That is a 16.369-millisecond difference, and there were only 4 or 5 VBO’s being updated.[/quote]it doesn't matter how many times you buffer, if you try to modify a buffer that was used already in this frame for drawing, the driver will have to allocate a temporal new buffer etc.

This performance behavior is true for nearly all mobile GPUs, PowerVR, Adreno, Mali... to my knowledge only Tegra has no deferred rendering and it could be fine with updates in the middle of the frame, but even there, you could stall until the HW is done with the last drawcall that was using this particular buffer.


On the one hand you can argue that low-end PC hardware can still give you a rough idea of what to expect on iOS devices due to many optimizations being “a good idea” for both platforms.
On the other hand you might still get hit by something that hits iOS harder than it does PC (dynamic/streaming VBO’s, dependent texture reads) or you might entirely miss opportunities to optimize for iOS specifically (discarding before clearing) simply because your only experience with iOS was a port. You didn’t get into the meat of the system and really learn how it works and how to exploit it for best performance.
[/quote]
you can still roughly get an idea of how it will run on low end devices. optimizing states, shaders, textures, meshes, improving culling etc. will help on all platforms.

I suggest to get the Imagination SDK, it has an 'emulator' for Opengl ES. (if you work on Android, then you could also check out the Adreno and Tegra SDK, they have quite some nice tools/libs to optimize textures, indexbuffer etc. for those platforms).

Actually on all platform you should first sort by shader (at least if overdraw is not an issue e.g. due to a z-pass). it's not related to drivers, it's how hardware works.

No, you shouldn’t, and no, it isn’t.
http://lspiroengine.com/?p=96
Swapping shaders is no longer such a big deal on DirectX 10 and DirectX 11 due to vertex buffers being forced to precompute an input layout—something that was done on each shader swap on DirectX 9 and still is done in OpenGL.
What happens under the hood is very much dependent on the drivers. Swapping textures and shaders is not the same cost in OpenGL vs. DirectX 9 on the same hardware.
On my machine, texture swaps are more inefficient in OpenGL, so sorting by textures first and shaders second is faster. Neither the hardware nor the scene changed.

You should always do your own benchmarks as I did when it comes to different hardware. However iOS devices are not different enough from each other that the results would ever be in favor of “shader first textures second” on one device and not on another.



changing depth order is completely irrelevant on iOS.

Which is why adding a sort slows it down. Extra work for nothing. Hence it is worth mentioning. That was my point.



on PowerVR hardware, a clear is simply a big triangle covering the whole screen.

No, it isn’t.
http://www.unrealeng...AA_Graphics.pdf



[quote name='Krypt0n' timestamp='1341076636' post='4954346']
please don't teach people that, that's simply not true. it has the exact same cost as drawing a full screen triangle on iOS.


No, it doesn’t.
http://www.unrealeng...AA_Graphics.pdf



[quote name='Krypt0n' timestamp='1341076636' post='4954346']
You should never update buffers once you start drawing

If you can avoid it, it is best. True on all hardware.



as the PowerVR gpu is deferred, this means, it keeps all buffers until the end of the frame

This is not an exactly accurate description, since flushes can happen mid-frame if the command buffer is full etc., but approximately correct.
Hodgman correctly points out below that this part of the pipeline is the same for all hardware, and the special extra stages added for deferred tile-based rendering are completely separate. I had a brief absence of mind in my original reply.



it doesn't matter how many times you buffer, if you try to modify a buffer that was used already in this frame for drawing, the driver will have to allocate a temporal new buffer etc.

Again, approximately correct.
But without getting into the details of what else the driver could be doing (flushing etc.), my profiling revealed one function as the most costly: memcpy().
Internally, glBufferSubData() calls memcpy() (regardless of whatever else it is doing such as allocating, flushing, etc.) and this call was, in all of our samples, games, and test cases, the second most time-consuming call after glDrawElements().



you can still roughly get an idea of how it will run on low end devices. optimizing states, shaders, textures, meshes, improving culling etc. will help on all platforms.

That’s not the point. As I said, certain things are just a good idea in general, and you are going to do them regardless of what hardware you are actually using as your testbed.
The point is what you don’t do as a result of not using the real device. You end up just porting your code over and calling it good without investigating what else you can do to make things faster specifically for that hardware. And if you care about performance at all then you must use a real device because that is the only way to gain access to “Instruments” which actually help you find time-consuming functions, redundant OpenGL ES 2 calls, and various other performance problems.



I suggest to get the Imagination SDK, it has an 'emulator' for Opengl ES.

We tried working with it at work for a few months until finally abandoning it due to the number of differences between it and a real device, plus several blatant bugs.


As I have been saying. There is no substitution for a real device. Not “similar” hardware, not SDK emulators, not even the iOS Simulator


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

You guys have already derailed the topic, so...

[quote name='L. Spiro' timestamp='1341021673' post='4954174']The biggest difference that pops out to me is in regards to clearing.
Clearing buffers on iOS devices just sets a flag. It doesn’t actually copy memory over the whole buffer etc.
It is instantaneous as long as it does not cause a resolve. You can avoid resolves by calling glDiscardFramebufferEXT() before glClear().
This means that what makes for a long operation on PC is virtually free on iOS.
please don't teach people that, that's simply not true. it has the exact same cost as drawing a full screen triangle on iOS.

on PC it depends on what GPU you have, if you use one that has HiZ, it will just invalidate some areas[/quote]@L. Spiro - I would expect PC GPUs to still have "fast clear" optimisation in hardware. This was present in PC hardware 5 years ago, so it should be still be around. It will likely depend on the texture format as to whether it's supported or not (more likely supported on depth and 8-bit channel formats).
@Krypt0n - HiZ can't possibly help when clearing a non-depth texture, and not every depth texture will necessarily be assigned a corresponding hierarchical representation (which yes, should support fast clear) -- there might only be a single HiZ "buffer" which the current depth target can make use of (but isn't permanently assigned to that target).

Actually on all platform you should first sort by shader (at least if overdraw is not an issue e.g. due to a z-pass). it's not related to drivers, it's how hardware works
On several widely popular (read: older) GPUs, a change to shader constants causes the same performance impacts as changing the shader program itself (which may or may not be a bottleneck for your scene, only sensible profiling would tell) -- so sorting by shader program isn't going to do anything on these GPUs if you're also changing any shader constants between draw-calls, as these are causing internal program switches anyway (unless grouping by shader helps you to reduce changes to shader-constants).
You should never update buffers once you start drawing, as the PowerVR gpu is deferred, this means, it keeps all buffers until the end of the frame, so when you lock, it will have to allocate new memory etc.
it doesn't matter how many times you buffer, if you try to modify a buffer that was used already in this frame for drawing, the driver will have to allocate a temporal new buffer etc.
This performance behavior is true for nearly all mobile GPUs, PowerVR, Adreno, Mali... to my knowledge only Tegra has no deferred rendering and it could be fine with updates in the middle of the frame, but even there, you could stall until the HW is done with the last drawcall that was using this particular buffer.
All regular PC GPUs do that (buffer your data/commands for at least one frame) -- this has nothing to do with mobile/deferred -- deferred/PowerVR style buffering is a completely different concept implemented between the primitive rasteriser and the pixel shader.

On several widely popular GPUs, changing a shader constants causes the same performance impacts as changing the shader program itself

Not just the GPU hardware but drivers and API are also a factor here.
Because my engine uses a custom shader language, it is able to fully parse shader files and generate buffers that represent all of the shader constants, so when I set a shader constant it can perform automatic redundancy checks.

For small constants such as floats and vectors, this always improves performance.

But I tested larger constants—vec4 arrays and matrices. The results are a bit surprising.

Between DirectX 9, OpenGL, and OpenGL ES 2 for iOS, none shared the same results.

In DirectX 9 it is best not to redundancy-check matrices and larger types. By always calling SetMatrixTranspose() or what-have-you, even if the same matrix is being set, it seems to still be faster.

In OpenGL for the same hardware, it appears that redundancy checking is faster. Remember that this is the same scene with the same number of redundant matrices occurring every frame. The overall pipeline is identical.

In OpenGL ES 2 for iOS, I was not able to measure any difference, whether checking for redundancies or not. This might ultimately end up true, and it makes sense for the hardware, but I plan to test a few times more in the future under different circumstances.


Your results may vary, especially if you have one of those cards mentioned by Hodgman.
But this just again shows that there are too many differences between drivers and API’s for “similar” hardware to have any real meaning.


L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

This topic is closed to new replies.

Advertisement