Back to General and Gameplay Programming

Fastest way to clear buffer?

General and Gameplay Programming Programming

Started by Atm97fin December 16, 2003 08:57 AM

8 comments, last by Atm97fin 20 years, 4 months ago

Atm97fin

136

Author

December 16, 2003 08:57 AM

Hi! I''m making a game, with my own 3d-engine. The game is in a state where I shouldn''t be thinking about optimizing yet, but it is very clear that great deal of time is spent in clearing my screen- and z-buffers. So I need very fast method for doing this. Currently I''m using memset(). Is there a way of speeding the clearing? I was thinking that because the memset uses bytes instead of dwords and my buffers are aligned to four byte, that I could make my own function to work with dwords(or qwords).

EasyGL - easy to use graphics library.

Madhed

4,095

December 16, 2003 09:07 AM

Maybe you could try it with assembler instructions

repsw

or something. Take a look at the asm code of the Memcpy instruction, it''s done using dword aligned copying.

Well, I don''t know if that''s much help

jamessharpe

497

December 16, 2003 09:40 AM

If you know that you will draw every pixel in your buffer on every frame, then just overwrite the screen buffer. You don''t need to clear it then. Just clear the Z buffer.

James

Atm97fin

136

Author

December 16, 2003 02:06 PM

quote:Original post by jamessharpe
If you know that you will draw every pixel in your buffer on every frame, then just overwrite the screen buffer. You don't need to clear it then. Just clear the Z buffer.

James

The game will be 3d-space shooter, so the background is going to be quite black. I'm propably going to make some kind of background image system, so that I don't need to clear whole buffer. But still the clearing and blitting will take some time.

I actually found links to fast memcopy articles and I believe I can now make faster methods than the standard memset and memcopy.

[edited by - Atm97fin on December 16, 2003 3:07:23 PM]

EasyGL - easy to use graphics library.

antareus

576

December 16, 2003 03:03 PM

quote:I actually found links to fast memcopy articles and I believe I can now make faster methods than the standard memset and memcopy.

Make sure your compiler and profiler agree with you before blindly using it.

--God has paid us the intolerable compliment of loving us, in the deepest, most tragic, most inexorable sense.- C.S. Lewis

sbennett

124

December 16, 2003 03:27 PM

quote:Original post by Atm97fin
I actually found links to fast memcopy articles and I believe I can now make faster methods than the standard memset and memcopy.

It''s probably not worth it. While you''re still debugging it, there''s not much point in trying to squeeze out little bits of performance. When you''re not debugging, most optimising compilers will replace memset(), memcpy(), etc., with assembly instructions (eg rep stosd, rep movsd on x86). I know VC++ 6 uses those two; I assume most compilers will do something similar.

If you use custom memset() or memcpy() routines, you''re not going to get much speed gain, and you prevent the compiler from creating intrinsic functions.

Jan Wassenberg

1,000

December 16, 2003 04:43 PM

Have you considered just marking chunks of the z-buffer ''invalid'', and testing for this when accessing them? I believe ATI''s Z-buffer compression scheme does this. Clearing would be quite fast

quote:It''s probably not worth it. [..]
If you use custom memset() or memcpy() routines, you''re not going to get much speed gain, and you prevent the compiler from creating intrinsic functions.

Harr! Things have changed since the 286 days

A quick test shows an 8-way unrolled movntq loop to be > 3x as fast (!) as memset.

quote:When you''re not debugging, most optimising compilers will replace memset(), memcpy(), etc., with assembly instructions (eg rep stosd, rep movsd on x86). I know VC++ 6 uses those two; I assume most compilers will do something similar.

That''s some optimization: rep stosd is actually slower than memset (implemented as loop by VC 7.1)

The string instructions just suck nowadays (well, one exception: copying around 64 byte aligned, cached blocks of memory).

E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

Anonymous

December 16, 2003 04:48 PM

quote:Original post by Jan Wassenberg
Harr! Things have changed since the 286 days A quick test shows an 8-way unrolled movntq loop to be > 3x as fast (!) as memset.

Mind sharing for those of us not so familiar with assembly, but still interested?

Atm97fin

136

Author

December 16, 2003 05:08 PM

movntq
Fast memcopy

[edited by - Atm97fin on December 16, 2003 6:09:16 PM]

EasyGL - easy to use graphics library.

Jan Wassenberg

1,000

December 16, 2003 05:47 PM

Sure!
With 8088 .. 286 processors, a large part of optimizing was reducing code size; the string instructions (lods, stos, movs, scas) were popular because they were smaller than a corresponding loop. Nowadays, they''re microcoded and so slow, they''re generally avoided.
Faster in this case is

	pxor	mm0, mm0	mov	edx, [dst]		; void* dst;	mov	eax, size_div_64l:	movntq	[edx], mm0	movntq	[edx+8], mm0	movntq	[edx+16], mm0	movntq	[edx+24], mm0	movntq	[edx+32], mm0	movntq	[edx+40], mm0	movntq	[edx+48], mm0	movntq	[edx+56], mm0	add	edx, 64	dec	eax	jnz	l	sfence	emms

The writes are combined (gathered in a buffer, and written out all at once) due to the movNonTemporal instruction; also, the buffer is not kept in cache - doing so would trash other useful data, while providing no benefit.
Unrolling 8x makes sense, because the Athlon''s write combine buffer is 64 bytes, IIRC. I haven''t measured the effects of more or less unrolling.
Note that the writes are weakly ordered (later instructions accessing this memory might be serviced before our transfer was actually written out), so we need to flush the write-combine buffer via sfence.

Atm97fin: when copying large, uncached arrays, block prefetching is significantly faster than those methods: it sustains transfer rates close to the theoretical maximum (which will never be reached because of memory access latency).

E8 17 00 42 CE DC D2 DC E4 EA C4 40 CA DA C2 D8 CC 40 CA D0 E8 40E0 CA CA 96 5B B0 16 50 D7 D4 02 B2 02 86 E2 CD 21 58 48 79 F2 C3

Fastest way to clear buffer?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Fastest way to clear buffer?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines