Archived

This topic is now archived and is closed to further replies.

BeanDog

How to do fast ASM blt? -no asm xp

Recommended Posts

You know, I am always reading things like, "Yeah, BltFast is the best unless you do the fast ASM blit." and such. Well, HOW DO YOU DO A FAST ASM BLIT? I''ve tried and it is REALLY slow! (Believe me, you don''t want to see my code) Please help me out, I need a really fast full-screen regular blit, no effects or anything. ~BenDilts( void );

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Seriously, what''s the point? What could you possibly be trying to do that BltFast isn''t fast enough? It''s fast enough for 99% of game developers.

Share this post


Link to post
Share on other sites
quote:
Original post by BeanDog

You know, I am always reading things like, "Yeah, BltFast is the best unless you do the fast ASM blit." and such. Well, HOW DO YOU DO A FAST ASM BLIT? I''ve tried and it is REALLY slow! (Believe me, you don''t want to see my code)

Please help me out, I need a really fast full-screen regular blit, no effects or anything.

~BenDilts( void );






Share this post


Link to post
Share on other sites
ASM Blits means that you copy pixels to the screen 32-bits at a time using only ASM registers - no references to memory except to where the bitmap is located and the video memory address (0xA0000000). I''m guessing you''re copying pixels to the screen ONE at a time, using something like REP MOVSB. You can copy four pixels at a time using REP MOVSD, for the B is for Byte, and the D is for Double Word. Processors these days can do both in the same amount of time, therefore quadrupling blitting speed.

-Ender Wiggin

Share this post


Link to post
Share on other sites
I''ve found the mmx instructions to be slightly faster (copy 8 bytes instead of 4). It depends a bit on the processor though. Also in DirectX your video memory address may not be 0xA0000000. You can only access the primary surface''s memory directly if you''re in exclusive mode (as far as I know). Otherwise you''ll have to blt to a client surface or some other secondary surface in video memory. If you need to get a pointer for something in video memory use Lock.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:

"Yeah, BltFast is the best unless you do the fast ASM blit."



BeanDog,

Nothing is faster than a hardware blt (Bltfast), and asm IS software, no
matter how you slice it.

Share this post


Link to post
Share on other sites
Most cards don't support blting from system to video. So if the surface you want to copy from is in system memory it's faster to use asm.

Edited by - blue-lightning on June 17, 2000 10:47:59 AM

Share this post


Link to post
Share on other sites
Anonymous Poster : BltFast doesn''t always work in hardware btw, and certainly won''t be in hardware if, as possibly mentioned, the surface the guys wants to copy is in system memory. And furthermore, there''s nothing directx does that you can''t do with ASM anyway, even bltfast.

My personal advice:
Use bltfast if you are copying hardware->hardware memory.
Use ASM if you are copying system->hardware memory. Preferably with the MMX instructions (still they aren''t that much faster because of the 32-bit bus or something like that)

Share this post


Link to post
Share on other sites
Bltfast is damn slow for RAM->VRAM and RAM->RAM.
I''ve made some benchmarking here. Blitting 128x128 image, 1000 times RAM->RAM. Look the results:
BltFast : 270 ms
My MMX blit routine : 32 ms

The best thing to do is, write your own asm blitters, blit all sprites to an RAM back surface. After all done, you blit this back surface to the Ddraw primary surface located at VRAM.
John Carmack used this approach on Quake.

Share this post


Link to post
Share on other sites
I''m using an MMX enhanced ASM Bliting routine and seems to be great. I haven''t ever gotten lower than 60fps (That''s what I''m limiting it too).

The bigger reason I''m using this is because if you want your game to work on older computers (200Mhz w/ 4Mb video) then you can''t possibly fit all of your graphics into vram. I''m making a RTS game and once you''ve got 60 different units, all the tiles, ui, etc.... you''ve got much more than a 4MB of gfx, also keeping in mind that the primary and backbuffer are in that 4Mb vram too. So I have a primary and backbuffer in vram and a system backbuffer in sysram. So I compose my system backbuffer all sysram to sysram w/ my MMX ASM Blit routine and then blit the sysback to the real backbuffer and flip. (Remember flipping is almost always just switching the pointers to the two surfaces, so it''s quick). I am running in 800 x 600 @ 16bpp w/ an axonometric (Isometric) perspective (Which requires a lot of transparent blits which take more time). I fit about 360 tiles in the screen which are all transparent blits of a 33x64 square with transparent corners. as well as ui, and such, and I''m easily maintaining a 60fps count and only got my primary and backbuffer in vram, none of anything else except flipping is done in hardware. So that''s the benefit here I can load as many graphics as we have sysram (Most people should have at least 32mb, when windows isn''t doing anything it doesn''t hog that much, I can still load in excess of 22 mb of graphics! Which you can''t do normally on a 200Mhz pentium w/ 4MB graphics card.

If you only want to do one full screen blit, in your application, even if it''s sysram to vram, you could still hold a fairly high frame rate. But if you want many blits like tiles and units in a game, which also means many graphics then you should go for an assembler MMX Blit routine.

Oh yah, the reason it''s nice to MMX (Can really boost the speed on PII''s and PIII''s) is not only just for bliting but because for alpha blitting you can execute multiplication''s on all of the loaded pixels at once in 1 cpu cycle where as normally you would have to do 1 mul for every pixel which takes something like 12(Don''t quote me on this, don''t remember exactly, but it''s way above 1 for sure) cpu cycles each! It''s just a newer technology which does beat the older stuff.

Hope this helps!
See ya,
Ben

Share this post


Link to post
Share on other sites
Oh yah if your interested I''ll send you my blitting code, however I use all MASM (Microsoft''s assembler, it''s available free :-) for my game, so if you want to put it into all in-line assembler for C or C++ you''ll be on your own!
Otherwise e-mail me, and I''ll send it to you.
See ya,
Ben

P.S. I''ve got two versions a straight plain blit and a translucent color blit.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
quote:

Oh yah, the reason it''s nice to MMX (Can really boost the speed on PII''s and PIII''s)



Why would one even use MMX ASM routines if not for extreme optimization?

It''s not like the MMX registers just sit there and do nothing if you don''t
use them. They still help out the CPU, whether you use them or not, so I
don''t see the point in using MMX ASM routines other than for optimization.

Share this post


Link to post
Share on other sites
Anonymous Poster: The MMX registers are the floating point registers. If you don''t use floats (like me, pure fixedpoint). The registers are going to waste.

Cyberben: Try it for fading; the staturating instructions work great!

Blazter: I did that in my game. It actually ends up being a lot faster if you do an asm blt from system to vram, then do a hardware blt to get that to the front buffer (if in windowed mode). The reason is that the computer doesn''t need the bus that much so sending out all the data is fine, and if it is sent back to ram it just needs to be copied to vram eventually. It is mostly because I don''t have much that overlaps. By the way I even coded in rotation, scew, and color key effects in my game. They work unbelievably fast!

Share this post


Link to post
Share on other sites
Thanx. I just wanna say that even though I haven''t contributed to this thread, I''ve learned a lot just from reading it. Thanks to everyone who gave helpful insight here!

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
Ok, I''m a believer now.

I checked my caps, and my card only supports video-to-video hardware blitting.
Everything else goes to HELL (err HEL ).

Just a couple questions. If the dwcaps.DDCAPS_CANBLTSYSMEM flag is set, does it
mean my card can do all three sys->vid, vid->sys, and sys->sys blts in hardware?

Or do I have to individual check the flags dwSVBCaps.DDCAPS_BLT, dwVSBCaps.DDCAPS_BLT,
and dwSSBCaps.DDCAPS_BLT for those 3 types of hardware supported blitting?

Share this post


Link to post
Share on other sites
OK, I''ve put the actual ASM code into inline. However, umm, it doesn''t work. Here we go:
HRESULT Pcopy(LPDIRECTDRAWSURFACE7 &Source, LPDIRECTDRAWSURFACE7 &Dest, int xs, int ys, bool trans)
{
DDSURFACEDESC2 ddsd;
DWORD sourcesurface;
DWORD destsurface;

Dest->Lock(NULL, &ddsd, 0, NULL);
long DestPitch = ddsd.lPitch;
destsurface = (DWORD)ddsd.lpSurface;

Source->Lock(NULL, &ddsd, 0, NULL);
long SourcePitch = ddsd.lPitch;
sourcesurface = (DWORD)ddsd.lpSurface;

RECT SourceRect;
SourceRect.top = 0;
SourceRect.bottom = ys;
SourceRect.left = 0;
SourceRect.right = xs;
DWORD DestX = 0;
DWORD DestY = 0;
WORD bpp = 16;
DWORD YLoop;
//DDSysBlit(16, spitch, dpitch, rect, 0, 0, ptrSource, ptrDest);//(DWORD)ptrDest, (DWORD)ptrSource, 0, 0, rect, dpitch, spitch, 16);
_asm
{
;#######################;
;ORDER OF EVENTS ;
;#######################;
;1-Calculate Source Addr
;2-Calculate Dest Addr
;3-Calculate Source Pitch Additive
;4-Calculate Dest Pitch Additive
;5-Place in ecx, the width of the blit in bytes
;6-Begin Bliting loop
;======================================
; Ok Let''s calc esi
;======================================
mov eax, SourceRect.top
;sub eax, 1
mov ebx, SourcePitch
mul ebx
add eax, SourceRect.left
add eax, SourceRect.left
add eax, sourcesurface
mov esi, eax

;======================================
; Ok Let''s calc edi
;======================================
mov eax, DestY
;sub eax, 1
mov ebx, DestPitch
mul ebx
add eax, DestX
add eax, DestX
add eax, destsurface
mov edi, eax

;So let''s use ecx and edx as SourcePitch Additive and DestPitch Additive
;ecx = SourcePitch - (SourceRect.right*2)
mov ebx, SourcePitch
sub ebx, SourceRect.right
sub ebx, SourceRect.right

;edx = DestPitch - (SourceRect.right*2)
mov edx, DestPitch
sub edx, SourceRect.right
sub edx, SourceRect.right

mov ecx, SourceRect.right ;Take the right of the blit and
add ecx, SourceRect.right ;Add it again because it''s 16bpp

mov eax, SourceRect.bottom
;sub eax, SourceRect.top
mov YLoop, eax
DDBlitSysLoop:

push ecx ;Save Blit width on stack
DDBlitSysInternalLoop:
;In here we need to copy esi to edi as many bytes as specified in ecx
movq MM0,[esi] ;Move in the first 4 pixels
movq [edi],MM0 ;Put the 4 pixels in the destination

add edi, 8 ;add 8 bytes to the source addr
add esi, 8 ;add 8 bytes to the dest addr
sub ecx, 8 ;subtract 8 bytes from the total width
cmp ecx, 0 ;See if we''ve moved all of the line
jne DDBlitSysInternalLoop ;If we''re not done go back
pop ecx ;Restore the blit width

add esi, ebx ;SourceAdditive
add edi, edx ;DestAdditive

dec YLoop ;Subtract 8 bytes or four pixels
cmp YLoop, 0 ;Check if we''re done this row of pixels
jnz DDBlitSysLoop

emms
}

Dest->Unlock(NULL);
Source->Unlock(NULL);
}
This should work, right? Well, it doesn''t.

Share this post


Link to post
Share on other sites
First let me say that that is a good try. Have you ever coded in asm before?

Some of your comments are wrong, but the code is write. It just makes it confusing. Such as ''So let''s use ecx and edx as SourcePitch Additive and DestPitch Additive,'' but then you use ebx and edx.

First problem. You are supposed to use emms before and after you use mmx instructions. If you are using an AMD cpu use femms. It''s a lot faster.

Next, and this is probably what''s stopping, you don''t take into account if the width isn''t a multiple of 4. If it isn''t ecx will never get to 0. It will skip it and go negative. Also usually in loops like this you don''t subtract 8. You precalculate the total number of times it should go through so you can use loop. You also precalculate the number of left over bytes (incase its not evenly divisible by 4). Then you can use rep movsw for the extra words (don''t forget to divide by 2).

Next avoid push and pop in a C function. Have it preallocated like you did with YLoop.

Also you make assumptions a lot that your width is sourcerect.right. For now it is, but if you change it, it won''t work.

If you need more help give me the whole project, and I can run it through my debugger.

For a good time hit Alt-F4! Go ahead try it, all the cool people are doing it.

Share this post


Link to post
Share on other sites
Thanks for the compliment, but it should go to Cyberben. He just lent me his code, and I did a little rearrangement and cut and paste.

However, i see what you're saying.

My error is an access violation on the line:
movq MM0,[esi] ;Move in the first 4 pixels
(duh)

I think esi wrapped around - debugger shows it as being 2576980376.

Ideas for a fix?
Source code a plus. Smart remarks need not apply

PS Thanks for the Alt-F4 trick it worked really well. Everyone reading this should try it.

hehehe you think they'll fall for it? I dont know, ben, maybe. hehehe

Edited by - BeanDog on June 18, 2000 10:05:04 PM

Edited by - BeanDog on June 18, 2000 10:07:17 PM

Share this post


Link to post
Share on other sites
blue-lightning:

1.) I know a my commenting is horrible just put this together a while ago! Your right I should comment better.

2.) I don''t think you need to use emms before the function for a couple reasons, it didn''t say so in the intel docs and your going to be replacing the values of registers anyhow? Didn''t know about femms either...

3.) I warned him about the width, it actually slows down the blit(s) when you take into account blits which are not multiples of 4. Because you have more decisions to make and possibly react on them. Too costly, as you''ll notice 640x480, 800x600, 1024x768 are all multiples of 4 pixels so for his purpose of doing full screen blits, it should be fine....

4.) I couldn''t get the rep-loop thing to work properly, and I it still seemed to be incredibly fast, so I didn''t bother.

5.) I''ve already been told about not using push and pop, because alledgeadly (sp?) these instructions are quite slow on certain computers and are fast on others. When you count it up with intel''s timing info. To re-calc the width of the blit each loop iteration would take the "same" number of cpu cycles on a intel system however it should be much faster on other systems.

6.) I don''t think I made any assumptions, I think I made the width sourcerect.right for a reason, ahhh.. I think I see why it''s not working or one of the reasons it used to be (sourcerect.right - sourcerect.left) but that takes more speed so I treated the right value as the width and the bottom as the height, didn''t say that.... :-)

Anyhow hope that helps some!
See ya,
Ben

P.S. Can you declare a variable in the C protion of code and use it in the assmbler portion with the same name?

Share this post


Link to post
Share on other sites
Ummm, I think there''s another way of writing that, try:
movq MM0, QWORD PTR [esi]

Well, maybe some of the other Assembler guru''s should give you some more advice as well..

And what do you mean esi has wrapped around?

See ya,
Ben

P.S. Oh yah, Alt+F4 does wonders, have you guys ever tried closing your eyes and pressing Ctrl+Alt+Delete two or three times. Wow, it''s amazing!

Share this post


Link to post
Share on other sites
BeanDog: I need to project to try to do anything else with it. You''re probably are using it in the wrong context.

Cyberben: Yeah, I tried it. The firework effects in the pong easter egg are really amazing!

Share this post


Link to post
Share on other sites
Cyberben: I mean, with unsigned values, if you subtract past 0, it wraps around to the highest value. Like this:

unsigned char x = 5;
x--; //x=4
x--; //x=3
x--; //x=2
x--; //x=1
x--; //x=0
x--; //x=255

You see?

blue-lightning: I use the Pcopy routine you see, and call it with this line of code:

Pcopy(TerrainLayer, Back, 800, 600, false);
(previously, the last bool value was whether or not to use color keying)

Hope this helps clear it up,



~BenDilts( void );

Share this post


Link to post
Share on other sites
BeanDog: Could you send me all of the source files so I can run it through my debugger? I don''t want to make a shell project for it.

For a good time hit Alt-F4! Go ahead try it, all the cool people are doing it.

Share this post


Link to post
Share on other sites
Guest Anonymous Poster
How do you check your DD caps for sys->vid hardware blitting? You
think you would at least check first before even assuming your user''s
card doesn''t support it.

Share this post


Link to post
Share on other sites