• Advertisement

Archived

This topic is now archived and is closed to further replies.

Odd Texturing performance hit (Directx8)

This topic is 5808 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

After seeing some odd performance scaling in the rendering engine I'm working on... I decided to do some serious tests to try and figure out what's going on. The initial states of the test are as follows only 1 call to SetTexture, is made per frame, regardless of the number of objects. Only one call to setStreamSource and SetVertexShader is made per test series. (i.e. I might run 10 tests with different object counts and 1000 frames each.. but I still only ever called SetStreamSource or SetVertexShader once each) All SetRenderState commands are buffered by a render state object and don't get updated unnecessarily. (and cause someone raised this in a reply I'll add the response in here) Textures are explicity loaded with the Managed Pool. All times are in Seconds. Now here's the data I got from my first series of tests :
                
          No Texture    Default Texture   Forced 256*256
                        (512*512)
         (selectarg2),  (selectarg1),     A4R4G4B4, (selectarg1)
         1000 frames    1000 frames       1000 frames
         -------------  ---------------   ------------------

10 cubes    1.406          2.938               2.125
20 cubes    1.500          4.531               2.985
30 cubes    2.047          6.125               3.828
40 cubes    1.844          7.765               4.672
50 cubes    1.938          9.438               5.531
60 cubes    2.062          11.031              6.406
70 cubes    2.172          12.641              7.235
80 cubes    2.312          14.312              8.109
90 cubes    2.454          15.985              9.000
100 cubes   2.758          17.609              9.844
          
     
there's a HUGE hit for adding textures... Why? I'd expect a hit, but nothing on this order of magnitude. Especially when you consider how consistantly small the non textured cubes times are. so to continue the investigation... we set the z buffer test to NEVER. the results are as follows :
           
          No Texture    Default Texture   Forced 256*256
                        (512*512)
         (selectarg2),  (selectarg1),     A4R4G4B4, (selectarg1)
         1000 frames    1000 frames       1000 frames
         -------------  ---------------   ------------------

10 cubes    1.438          2.234               1.625
20 cubes    1.531          3.156               1.921
30 cubes    2.125          4.141               2.250
40 cubes    1.922          5.078               2.625
50 cubes    2.078          6.047               2.938
60 cubes    2.188          7.000               3.266
70 cubes    2.359          7.969               3.609
80 cubes    2.516          8.937               2.953
90 cubes    2.671          9.891               4.281
100 cubes   2.844          10.859              4.625

                
assuming I typed those all in right.. and if you did the math... you'ld find that there's an overhead of .08 per object per test for the default texture and .0178 per object per test when the texture format is forced to 256*256 at A4R4G4B4. A factor of 4 between the two consistant discrepencies. Yet in this last test set NO writes are ever sent to the color buffer (and I would assume this bypasses the texture stages as those would be called after the test fails.) the only difference between the first test and the second test is that the COLOROP is changed from SELECTARG1 (the texture) to SELECTARG2 (there are litterally no other changes made before compiling) The results of all this testing is that I find myself with an overhead for every DrawIndexedPrimitive that MIGHT draw a texture (if the z buffer ever passed). That overhead is consistant and scalable. And the overhead has a direct relationship with the ammount of memory required to hold the textures. There were other tests made that indicate it does NOT scale with the ammount of geometry rendered by the DrawIndexedPrimitive count. So.. anyone know what gives? Is there some render state of which I'm ignorant that would stop useless internal texture loads on my vid card (a gf2) ? My engine purposfully organizes objects so that those with the same texture are drawn in groups, a useless optimization if the texture gets reloaded regardless. Thanks in advance if you got through that whole thing I showed the results to a friend of mine who's into game programming and he had the exact same reaction as myself... that this isn't right. Sorrow Edited by - Sorrow on February 22, 2002 11:11:03 PM

Share this post


Link to post
Share on other sites
Advertisement
what card/hardware etc?
and: why not turn color-writes on, maybe you get more realistic results then, hm?


--- foobar
We push more polygons before breakfast than most people do in a day

Share this post


Link to post
Share on other sites
:: looks at foobar and tries to figure out if he was joking ::

umm... I turned the z buffer to NEVER (they WERE turned on in the first test set) to illustrate that the performance difference isn't due to the actual drawing of the texture, but rather to the act of making the texture available to be drawn from. A very important distinction. The texture is never accessed for pixel writes because there ARE no pixel writes, yet the drop in performance remains.

Also, explicitly turning color buffer writes off might have an entirely different result than just setting the z buffer test to fail.

The hardware is a Geforce2 PRO with 64 megs of memory. The SDK I'm using is dx8.0 . I'm using vc5.0 but passing my code to a friend who's using vc6.0 doesn't seem to make nay difference in the way it performs. Operating system is XP. Processor is an Athalon 1800xp with 750megs of memory.

oh.. and the dx runtime is 8.1 of course.



Edited by - Sorrow on February 22, 2002 9:48:45 PM

Share this post


Link to post
Share on other sites
Maybe your textures are stored in system memory?
Same texture format for both?
Running in a window or not?
Or it could be a bandwidth thing. Perhaps the driver is still doing calculations on the pixel even though the z-cmp failed. Then it will discard the pixel later in the pipeline.

Share this post


Link to post
Share on other sites
>>>Maybe your textures are stored in system memory?

I'm using the managed pool... with 64megs of memory I can't think of a single reason the driver would store it anyplace other than on the vid card. If someone can think of one I'm all ears

>>>>Same texture format for both?
I'm not sure how the first texture is being stored in memory. I'm using the vanilla texture load command instead of the EX version. Paint reports it as a 24bit bitmap, but for all I know an alpha value might get added when it gets loaded.

That said when I force the pixel format to A8R8G8B8 it goes even slower... suggesting to me that it normally gets shrunk to 16 bits instead of expanded to 32 (odd but that's the indication.)

With a forced texture format of A8R8G8B8, 100 cubes and zbuffer set to lessequal(i.e. drawing enabled) it takes 23 seconds to do 1000 frames, 6 seconds worse than when no format is forced.

>>>>>Running in a window or not?

Full Screen

>>>>Or it could be a bandwidth thing. Perhaps the driver
>>>>is still doing calculations on the pixel even though
>>>>the z-cmp failed. Then it will discard the pixel later
>>>>in the pipeline.

Isn't the advantage of the z-buffer that it lets you not do those calculations when a z-test fails? It would seem to me odd to make a video card that does unnecessary pixel calcluations. That aside, even that would only explain why the problem still exists when I set the zbuffer test to NEVER. It wouldn't explain why the problem exists. (remember even when drawing , at 100 cubes it takes up to 9 times longer to draw the cubes with the texture than without...)

Edited by - Sorrow on February 22, 2002 10:20:28 PM

Share this post


Link to post
Share on other sites
Well, I have no idea really. I''d have to see some code to know for sure. Sounds weird. Hopefully you''ll find the solution. Keep us updated.

Share this post


Link to post
Share on other sites
I'd actually be curious to see someone else run a similar test and see if they come up with similar results. It MIGHT be a legitimate hit for using small textued objects.

Something to add from another test I just did... that the overhead is obviously something concurrent that gets removed when a large enough geometry is rendered in a single buffer. If I run the same tests on a 2600 polygon ship I get the following results (again, everything is for 1000 frames of the given case):

100 ships -- No Texture, : 19.125
100 ships -- 256*256 A4R4G4B4 Texture. : 26.813
100 ships -- No Texture, zbuffer set to never : 18.953
100 ships -- 256*256 A4R4G4B4 Texture, zbuffer set to never 18.953

And before someone sais "there you go then, just don't render useless vertex buffers with only 12 polys in them", let me do the math. The 100 cubes takes 17.609 seconds to render as soon as you add the texture. If you assume that the polygon to horsepower ratio stays constant (which I know it doesn't.. but bare with me) then :

100 ships -- No Texture : 19.125 - screen draw (lets say about 2 seconds since that's the bottom cap for every single test) = 17.25 poly rendering

2600/17 = 159.94

With Z buffer set to never, we had 10.859 seconds for the defualt texture cubes and 4.625 seconds when the 256*256@A4R4G4B4 cubes. The difference from the non textured version in the first case is about 8 seconds and in the second case about 2 seconds.

This means that the overhead in the first case takes the equivalent time as rendering 1279.52 polygons per object... and in the second case as rendering 318 polygons per object. This suggests that you can never get better textured performance than when you render those number of polygons, simply because it takes that many to mask this concurrent texture "overhead" of an unkown nature. And god help you don't render less than a thousand with a 512*512 texture set =O

Now maybe this really is the case... and maybe anything significantly small can't be rendered as a single vertex buffer, textured, without incurring that kind of huge overhead (even when you're just reusing the same texture over and over for each buffer, and when untextured the speeds are blistering). I find it hard to beleive but maybe that's just the way it is and these results are legitimate. If so I need to know that too.

Right now I'd probably get more perfromance if I manually took all 100 cubes, disabled the world matrix, did the world matrix transformation on each myself, and plopped a copy of each of the cubes into a new vertex buffer in it's transformed state. Then rendered THAT vertex buffer of around 1200 vertices. It'd be horribly painful but the overhead would go away and probably outweigh any of the cost for the software transform and vertex buffer update (which would probably get masked by concurency if I rendered something else first) necessary to create the transformed version of the cubes. What's more as it's a multipass rendering engine I'm making, and the buffer containing the transformed cubes would only have to be created once per frame rather than per pass, it becomes an even more feasable approach (4 out of 5 passes require textures).

Actually, what originally started me doing the tests was the massive LACK of difference between the frame rates I saw for equal numbers of 2600 poly ships, and 12 poly cubes.

Michael


Edited by - Sorrow on February 23, 2002 12:56:39 PM

Share this post


Link to post
Share on other sites

  • Advertisement