CPU vertex shaders

Started by
3 comments, last by JonnyQuest 21 years, 8 months ago
I''m in the process of designing the way I''ll implement vertex shaders into the engine I''ll be developing soon. As we all know, vertex shaders are not supported by every piece of hardware, but Microsoft implemented a CPU shader unit emulator into DirectX8. I''ve read loads of stuff about it in various MS articles, but I''m still not entirely sure if I got this right. - On shader-supporting cards, it''s all clear, just go ahead and use them. Vertex streams are typically placed in video or AGP (nonlocal video) memory. - On cards that don''t support T&L in hardware, the software shader just replaces the software FVF T&L engine, so there are no problems there, either. Vertex streams are in system memory. - Now the (to me) tricky bit: On cards that support T&L in hardware, but no shaders, I''m basically running the shader on the CPU, and sending the transformed vertices to the card, bypassing the hardware T&L unit, right? By default, vertex buffers are in video memory, so do I need to tell DX to put them in system memory if I''m using a shader? Performance is also an issue, as far as I understand: - with hardware shaders, there should be no difference in performance as long as I''m performing equivalent operations. - with software shaders on non hw-tnl cards, the same should apply, as I''m doing the same thing in software, and the AGP bandwidth consumed, is the same either way. - with software shaders on non hw-tnl cards, the FVF version should be significantly faster, as the AGP bandwidth used is minimal, and I''m using hardware for transformation. If I use shaders, I use the (somewhat slower) transforming routines on the CPU, AND submitting it over the AGP bus. So: if the hardware supports shaders, I can use them freely if the hardware doesn''t even support T&L, I can use them freely as well if the hardware has hardware T&L support, but no shader support, I should use shaders sparingly, and use the FVF hardware t&l unit as much as possible. Is this correct or am I missing something? Thanks - JQ Full Speed Games. Coming soon.
~phil
Advertisement
quote:
- Now the (to me) tricky bit:
On cards that support T&L in hardware, but no shaders, I''m basically running the shader on the CPU, and sending the transformed vertices to the card, bypassing the hardware T&L unit, right? By default, vertex buffers are in video memory, so do I need to tell DX to put them in system memory if I''m using a shader?


Yes, yes.

quote:
- with software shaders on non hw-tnl cards, the same should apply, as I''m doing the same thing in software, and the AGP bandwidth consumed, is the same either way.


Yes.

quote:
- with software shaders on non hw-tnl cards, the FVF version should be significantly faster, as the AGP bandwidth used is minimal, and I''m using hardware for transformation. If I use shaders, I use the (somewhat slower) transforming routines on the CPU, AND submitting it over the AGP bus.


I assume you made a typo there and mean software shaders on hw T&L cards are slower than FVFs on HW T&L cards. If so, then yep that''s true.

You should be aware that AGP bandwidth has nothing to do with it in most cases!. Most HW T&L drivers put FVF vertex buffers in AGP memory anyway, reserving true VRAM for textures and internal stuff.

You should ALSO be aware that SW CPU transformation is NOT much slower than HW T&L transformation, and in some cases is FASTER.

The main point of why HW T&L is good is you can run it in PARALLEL with the CPU, i.e. you set the CPU to work doing other stuff. If your engine design doesn''t exploit this parallelism well, then you won''t see much difference in frame rate terms between HW and SW T&L, and may even see SW T&L as faster.


quote:
if the hardware has hardware T&L support, but no shader support, I should use shaders sparingly, and use the FVF hardware t&l unit as much as possible


Almost ...

D3D can only be in one mode at a time SOFTWARE or HARDWARE. So you have to create your D3D device for MIXED vertex processing. To toggle between software (i.e. your vertex shaders) and hardware (i.e. your FVF buffers) you have to toggle the D3DRS_SOFTWAREVERTEXPROCESSING state.

ISTR someone from MS saying that it was a pretty major state which did a lot of stuff - in parts akin to releasing the device and recreating it for the other type of processing!!!!.
So if you do design an engine where you intend on switching between software and hardware processing, make sure you keep the number of switches per frame to an absolute minimum (i.e. 2-3 max if possible).


FWIW we''ve taken an initial decision here to go for only 2 paths when we **need** to use shaders:

1) software VP if the chip doesn''t do shaders
2) hardware VP if the chip does do shaders


using the MIXED device requires too much work to justify the time it''d take to get totally efficient and for the extra performance we''d gain.


One strategy which may work and help to preserve that 1-2 D3DRS_SOFTWAREVERTEXPROCESSING limit is to use ProcessVertices() for all your shaders and then pass the post transform buffers to Draw*Primitive() for final processing. Particularly handy if you make the shader output in model space or change the transform the HW does...


--
Simon O''Connor
Creative Asylum Ltd
www.creative-asylum.com

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

quote:Original post by S1CA

I assume you made a typo there and mean software shaders on hw T&L cards are slower than FVFs on HW T&L cards. If so, then yep that''s true.

Yep, that''s what I meant. The wording was a bit off, sorry
One question: do I have to init the Vertex Buffer explicitly as being in system memory?

quote:
You should be aware that AGP bandwidth has nothing to do with it in most cases!. Most HW T&L drivers put FVF vertex buffers in AGP memory anyway, reserving true VRAM for textures and internal stuff.

Ah right, but nonlocal->graphics card IS faster than sysmem->card as far as I''ve understood it, right?

quote:
You should ALSO be aware that SW CPU transformation is NOT much slower than HW T&L transformation, and in some cases is FASTER.
The main point of why HW T&L is good is you can run it in PARALLEL with the CPU, i.e. you set the CPU to work doing other stuff. If your engine design doesn''t exploit this parallelism well, then you won''t see much difference in frame rate terms between HW and SW T&L, and may even see SW T&L as faster.

Yeah - I''m aware of that fact - and yeah, I''ll try to make it run in parallel as much as I possibly can.

quote:Almost ...

D3D can only be in one mode at a time SOFTWARE or HARDWARE. So you have to create your D3D device for MIXED vertex processing. To toggle between software (i.e. your vertex shaders) and hardware (i.e. your FVF buffers) you have to toggle the D3DRS_SOFTWAREVERTEXPROCESSING state.

Right, I did know about the mixed thing, but I hadn''t thought about this:
quote:
ISTR someone from MS saying that it was a pretty major state which did a lot of stuff - in parts akin to releasing the device and recreating it for the other type of processing!!!!.
So if you do design an engine where you intend on switching between software and hardware processing, make sure you keep the number of switches per frame to an absolute minimum (i.e. 2-3 max if possible).

It really makes sense when you come to think of it, and it''s pretty obvious as well. Still hadn''t thought of it, damn.

quote:FWIW we''ve taken an initial decision here to go for only 2 paths when we **need** to use shaders:
1) software VP if the chip doesn''t do shaders
2) hardware VP if the chip does do shaders
using the MIXED device requires too much work to justify the time it''d take to get totally efficient and for the extra performance we''d gain.

Right. It probably depends on the amount of shaders I''ll be using in the end - if I use lots, I might be better off using ONLY them, but if they''re used occasionally only, the hardware TnL might be advantage enough to make up for the time lost with the state switching.

quote:
One strategy which may work and help to preserve that 1-2 D3DRS_SOFTWAREVERTEXPROCESSING limit is to use ProcessVertices() for all your shaders and then pass the post transform buffers to Draw*Primitive() for final processing. Particularly handy if you make the shader output in model space or change the transform the HW does...

That sounds like a fantastic idea. I''ll look into ProcessVertices()

You''ve been a great help - thanks!

- JQ
Full Speed Games. Coming soon.
~phil
quote:One question: do I have to init the Vertex Buffer explicitly as being in system memory?


Sort of, you specify D3DUSAGE_SOFTWAREPROCESSING as one of the Usage flags on the create call.


quote:Ah right, but nonlocal->graphics card IS faster than sysmem->card as far as I''ve understood it, right?


Yes. Some newer chips can DMA directly from system memory, though AGP memory is specifically designed for that so should be better for the graphics chip to read.

--
Simon O''Connor
Creative Asylum Ltd
www.creative-asylum.com

Simon O'Connor | Technical Director (Newcastle) Lockwood Publishing | LinkedIn | Personal site

Thanks!

- JQ
Full Speed Games. Coming soon.
~phil

This topic is closed to new replies.

Advertisement