Fastest way to clear buffer?

Started by
8 comments, last by Atm97fin 20 years, 4 months ago
Hi! I''m making a game, with my own 3d-engine. The game is in a state where I shouldn''t be thinking about optimizing yet, but it is very clear that great deal of time is spent in clearing my screen- and z-buffers. So I need very fast method for doing this. Currently I''m using memset(). Is there a way of speeding the clearing? I was thinking that because the memset uses bytes instead of dwords and my buffers are aligned to four byte, that I could make my own function to work with dwords(or qwords).
EasyGL - easy to use graphics library.
Advertisement
Maybe you could try it with assembler instructions

repsw

or something. Take a look at the asm code of the Memcpy instruction, it''s done using dword aligned copying.

Well, I don''t know if that''s much help
If you know that you will draw every pixel in your buffer on every frame, then just overwrite the screen buffer. You don''t need to clear it then. Just clear the Z buffer.

James
quote:Original post by jamessharpe
If you know that you will draw every pixel in your buffer on every frame, then just overwrite the screen buffer. You don't need to clear it then. Just clear the Z buffer.

James


The game will be 3d-space shooter, so the background is going to be quite black. I'm propably going to make some kind of background image system, so that I don't need to clear whole buffer. But still the clearing and blitting will take some time.

I actually found links to fast memcopy articles and I believe I can now make faster methods than the standard memset and memcopy.

[edited by - Atm97fin on December 16, 2003 3:07:23 PM]
EasyGL - easy to use graphics library.
quote:I actually found links to fast memcopy articles and I believe I can now make faster methods than the standard memset and memcopy.

Make sure your compiler and profiler agree with you before blindly using it.
--God has paid us the intolerable compliment of loving us, in the deepest, most tragic, most inexorable sense.- C.S. Lewis
quote:Original post by Atm97fin
I actually found links to fast memcopy articles and I believe I can now make faster methods than the standard memset and memcopy.

It''s probably not worth it. While you''re still debugging it, there''s not much point in trying to squeeze out little bits of performance. When you''re not debugging, most optimising compilers will replace memset(), memcpy(), etc., with assembly instructions (eg rep stosd, rep movsd on x86). I know VC++ 6 uses those two; I assume most compilers will do something similar.

If you use custom memset() or memcpy() routines, you''re not going to get much speed gain, and you prevent the compiler from creating intrinsic functions.
Have you considered just marking chunks of the z-buffer ''invalid'', and testing for this when accessing them? I believe ATI''s Z-buffer compression scheme does this. Clearing would be quite fast

quote:It''s probably not worth it. [..]
If you use custom memset() or memcpy() routines, you''re not going to get much speed gain, and you prevent the compiler from creating intrinsic functions.

Harr! Things have changed since the 286 days A quick test shows an 8-way unrolled movntq loop to be > 3x as fast (!) as memset.

quote:When you''re not debugging, most optimising compilers will replace memset(), memcpy(), etc., with assembly instructions (eg rep stosd, rep movsd on x86). I know VC++ 6 uses those two; I assume most compilers will do something similar.

That''s some optimization: rep stosd is actually slower than memset (implemented as loop by VC 7.1) The string instructions just suck nowadays (well, one exception: copying around 64 byte aligned, cached blocks of memory).
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3
quote:Original post by Jan Wassenberg
Harr! Things have changed since the 286 days A quick test shows an 8-way unrolled movntq loop to be > 3x as fast (!) as memset.


Mind sharing for those of us not so familiar with assembly, but still interested?
movntq
Fast memcopy

[edited by - Atm97fin on December 16, 2003 6:09:16 PM]
EasyGL - easy to use graphics library.
Sure!
With 8088 .. 286 processors, a large part of optimizing was reducing code size; the string instructions (lods, stos, movs, scas) were popular because they were smaller than a corresponding loop. Nowadays, they''re microcoded and so slow, they''re generally avoided.
Faster in this case is
	pxor	mm0, mm0	mov	edx, [dst]		; void* dst;	mov	eax, size_div_64l:	movntq	[edx], mm0	movntq	[edx+8], mm0	movntq	[edx+16], mm0	movntq	[edx+24], mm0	movntq	[edx+32], mm0	movntq	[edx+40], mm0	movntq	[edx+48], mm0	movntq	[edx+56], mm0	add	edx, 64	dec	eax	jnz	l	sfence	emms 

The writes are combined (gathered in a buffer, and written out all at once) due to the movNonTemporal instruction; also, the buffer is not kept in cache - doing so would trash other useful data, while providing no benefit.
Unrolling 8x makes sense, because the Athlon''s write combine buffer is 64 bytes, IIRC. I haven''t measured the effects of more or less unrolling.
Note that the writes are weakly ordered (later instructions accessing this memory might be serviced before our transfer was actually written out), so we need to flush the write-combine buffer via sfence.

Atm97fin: when copying large, uncached arrays, block prefetching is significantly faster than those methods: it sustains transfer rates close to the theoretical maximum (which will never be reached because of memory access latency).
E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

This topic is closed to new replies.

Advertisement