Archived

This topic is now archived and is closed to further replies.

Fastest way to clear buffer?

This topic is 5114 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi! I''m making a game, with my own 3d-engine. The game is in a state where I shouldn''t be thinking about optimizing yet, but it is very clear that great deal of time is spent in clearing my screen- and z-buffers. So I need very fast method for doing this. Currently I''m using memset(). Is there a way of speeding the clearing? I was thinking that because the memset uses bytes instead of dwords and my buffers are aligned to four byte, that I could make my own function to work with dwords(or qwords).

Share this post


Link to post
Share on other sites
Maybe you could try it with assembler instructions

repsw

or something. Take a look at the asm code of the Memcpy instruction, it''s done using dword aligned copying.

Well, I don''t know if that''s much help

Share this post


Link to post
Share on other sites
quote:
Original post by jamessharpe
If you know that you will draw every pixel in your buffer on every frame, then just overwrite the screen buffer. You don't need to clear it then. Just clear the Z buffer.

James


The game will be 3d-space shooter, so the background is going to be quite black. I'm propably going to make some kind of background image system, so that I don't need to clear whole buffer. But still the clearing and blitting will take some time.

I actually found links to fast memcopy articles and I believe I can now make faster methods than the standard memset and memcopy.

[edited by - Atm97fin on December 16, 2003 3:07:23 PM]

Share this post


Link to post
Share on other sites
quote:
I actually found links to fast memcopy articles and I believe I can now make faster methods than the standard memset and memcopy.

Make sure your compiler and profiler agree with you before blindly using it.

Share this post


Link to post
Share on other sites
quote:
Original post by Atm97fin
I actually found links to fast memcopy articles and I believe I can now make faster methods than the standard memset and memcopy.

It''s probably not worth it. While you''re still debugging it, there''s not much point in trying to squeeze out little bits of performance. When you''re not debugging, most optimising compilers will replace memset(), memcpy(), etc., with assembly instructions (eg rep stosd, rep movsd on x86). I know VC++ 6 uses those two; I assume most compilers will do something similar.

If you use custom memset() or memcpy() routines, you''re not going to get much speed gain, and you prevent the compiler from creating intrinsic functions.

Share this post


Link to post
Share on other sites
Have you considered just marking chunks of the z-buffer ''invalid'', and testing for this when accessing them? I believe ATI''s Z-buffer compression scheme does this. Clearing would be quite fast

quote:
It''s probably not worth it. [..]
If you use custom memset() or memcpy() routines, you''re not going to get much speed gain, and you prevent the compiler from creating intrinsic functions.

Harr! Things have changed since the 286 days A quick test shows an 8-way unrolled movntq loop to be > 3x as fast (!) as memset.

quote:
When you''re not debugging, most optimising compilers will replace memset(), memcpy(), etc., with assembly instructions (eg rep stosd, rep movsd on x86). I know VC++ 6 uses those two; I assume most compilers will do something similar.

That''s some optimization: rep stosd is actually slower than memset (implemented as loop by VC 7.1) The string instructions just suck nowadays (well, one exception: copying around 64 byte aligned, cached blocks of memory).

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:
Original post by Jan Wassenberg
Harr! Things have changed since the 286 days A quick test shows an 8-way unrolled movntq loop to be > 3x as fast (!) as memset.



Mind sharing for those of us not so familiar with assembly, but still interested?

Share this post


Link to post
Share on other sites
Sure!
With 8088 .. 286 processors, a large part of optimizing was reducing code size; the string instructions (lods, stos, movs, scas) were popular because they were smaller than a corresponding loop. Nowadays, they''re microcoded and so slow, they''re generally avoided.
Faster in this case is

pxor mm0, mm0
mov edx, [dst] ; void* dst;
mov eax, size_div_64
l:
movntq [edx], mm0
movntq [edx+8], mm0
movntq [edx+16], mm0
movntq [edx+24], mm0
movntq [edx+32], mm0
movntq [edx+40], mm0
movntq [edx+48], mm0
movntq [edx+56], mm0
add edx, 64
dec eax
jnz l
sfence
emms

The writes are combined (gathered in a buffer, and written out all at once) due to the movNonTemporal instruction; also, the buffer is not kept in cache - doing so would trash other useful data, while providing no benefit.
Unrolling 8x makes sense, because the Athlon''s write combine buffer is 64 bytes, IIRC. I haven''t measured the effects of more or less unrolling.
Note that the writes are weakly ordered (later instructions accessing this memory might be serviced before our transfer was actually written out), so we need to flush the write-combine buffer via sfence.

Atm97fin: when copying large, uncached arrays, block prefetching is significantly faster than those methods: it sustains transfer rates close to the theoretical maximum (which will never be reached because of memory access latency).

Share this post


Link to post
Share on other sites