This topic is now archived and is closed to further replies.


newbie dirx7 Dilemma - my 640x480x32 is slow

Recommended Posts

brewski    122
Hello! This is my first post in this great forum :-) I''ve started out doing some win-code with dirx7, I''ve coded a wrapperclass with common setup-code for back/front buffers etc. I create a bitmap array where I manipulate pixels and when done each frame, lock N copy this array to back buffer,unlock, wait for vert.retrace and flip - like I''ve seen in many tuts and docs. Now the problem I''ve experiencing is this: In 320x200x32 and 320x240x32 resolutions, I get 50fps like I want, but when switching to 640x480x32, I suddenly get like 4-5 fps instead. I''ve tried switching to 16bits, using 2 backbuffers, and with the same result. Major lag. The only difference in switching mode in my routine is -> clearing a larger bitmap, copying a larger bitmap array to the backbuffer. Apart from that the pixelprocessing I do is still the same as for less resolution where I get good fps. I cannot understand how this would slow down everything so much, or what I could have done wrong within my dx7-code. (my cpu is a 1200hz amd, with geforceII) If you think you have a clue about my mistake, or have had the same problem, please respond. I would be very happy for any responses. Brewski

Share this post

Link to post
Share on other sites
Anarchi    122
Are you using ->Blt() with the fill flag to clear the large surface? If so thats your problem, I have a P4/1600 w/GeForce and even at 640x480x24 the FPS decreases to about 4 using the fill flag when I know my card is capable of >400FPS (no vert-sync). Seems M$ didnt optimize that routine at all...

Downloads, Free GoldLib game library, D3DXSprite tutorial, New platform game: .-= The ZeroOne Realm =-.

Share this post

Link to post
Share on other sites
brewski    122
hmm...No, I''m using a pixel-array[scrWidth*scrHeight].
(No BMP''s.)

I clear it with an unrolled loop.
Set lots of pixels.
Copy it to backbuffer with an unrolled loop.


Share this post

Link to post
Share on other sites
a person    118
Tips for fast pixel manipulation in directdraw

1. DONT sync to refresh. it will utterly destroy your framerate when you can draw fast enough to keep up.

2. all pixel manipulated buffers that you read from MUST be in system memory. the only exception to this rule is for a screenshot function. if you feel reading from vram for any other reason is worth it for a game, you might as well quite graphics programming now.

3. only use 16bit or 32bit color depths. 24bit is slow. also only access the buffer using a DWORD or USHORT never read in one byte at a time.

4. most cards cant do much hardware acceleration on sysmem surfaces. this goes for fills. on the other hand most ago cards can accelerate blits from sysmem to vram.

5. anything in sysmem should be completly drawn and copied (lock() or blt() ) all at once from a single buffer. this will reduce bandwidth.

6. dirty rects are your friend and reduce bandwidth.

7. there is ussually no reason to clear the screen or an back buffers. if your copy pixels to the buffer, this means they dont need to be cleared.

8. DONT split your sysmem to vram copies to small sections unless you are doing dirty rects. multiple locks()/blts() can cause slowness to occer.

9. it dont matter what video card you have when dealing with copying data from sysmem to vram. only thing that matter is if its agp, and how much it can transfer (ie agpx1 agox2, agp4x). an 4mb ati rage pro is just as good as a 32mb geforce2 when dealing with this.

10. 640x480x32 uses 4x as much bandwidth as 320x240x32. this means that at best you should see a decrease of only 25% when moving up the res. its obvious your doing something wierd, turn off vsync.

Anarchi, your system CANT do 400 frame per second no matter how optimized you make your blit routine when moving that data from sysmem to vram. seems like you dont know what your talking about. 400fps seems more like a everything is in vram, and you only blt vram to vram.

just for comparison purposes. i do a fullscreen fade, fullscreen blur, draw 512 particles, running in 640x480x32. I get an avg of 47 frames per second without vsync and about 36 with vsync (refresh of 100hz). this is on a p4 1500mhz. so its definatly your code. also this test had winamp running in the background. i get slightly better speed when its not running. btw, when blitting to the backbuffer, i use a single lock() and move the entire buffer from sysmem. also i dont allocate my sysmem buffer using dx, its allocated using malloc(). it should not make a difference though.

so in the end i am moving an entire screen over 2 times (since the blur does 9 reads per write, so its pretty high quality. MMX helps greatly with this). THEN i am moving the entire buffer to vram, and doing the flip() (flips are basically free when not using vsync).

Share this post

Link to post
Share on other sites
brewski    122
Thanx you both for your posts =)

Here is more details:

.All my pixel manipulated buffers are in system memory
.I use 32bit color depth. ( 00 RR GG BB ) accessed with DWORDS
.I copy my buffer all at once:
lock() ,big unrolled loop copys buffer, unlock()
.my system memory is allocated via a pointer (c++)
.Backbuffer is not cleared since it's overwritten by array

Actually, the fault may be in my pixel code as A person suggests.
What I do is, once per frame:
.clear the whole pixelarray to zero.
.Set a z-buffer array (WORDS) to max
.calculate a torus 512 verts + normals
.draw bumpmapped-trianglePolys to array

I realize there gotta be lots of cache-thrashing in my code since of so many table/mem lookups, I don't know how to reduce that though.

However, It's got good fps at 320x240 (around 25 I would guess)
But when switching to 640x480 drops to bout 4.
A person suggests that it should drop by 25%, this would mean about 20 fps which is not the case.


I've now coded something else to test if the fault is in my code (and it is I believe)
I did a standard fireFilter routine in 640x480x32 with a lookuptable for rgb colors.
This routine used exactly the same ddraw code as the routine I have problems with.
I removed the waitRetrace in flip(),
and this runs at some great fps!!! :-)

->Still wonder why myw 3d-proggy lose so much speed while double resolution...

->Using rects assume that I do a blt()?
Can I use blt() to transfer my own sysmemory pixelarrays to backbuffer,
I only seen blt()'s between videomemory.
( and would blt() be faster than my unrolled loop? )

->I can't use mmx cause I currently use Dev-c++.
But I wonder how you (A Person) use them in the filter.
You read several pixels from your array into one register?
If you use the kind of blur I use, adding up the intensity of the pixels surrounding the current pixel -> divide by nr of pixels added -> store in current do you do the adding if several values are in one register?


[edited by - brewski on April 30, 2002 7:13:56 AM]

Share this post

Link to post
Share on other sites
a person    118
yep, same type of blur. i read all 8 surrounding pixels plus the current pixel. i do this by read the first pixel, then each subsecent pixel i read i add to the current pixel. so after read them all, i have already added them all up. i also unpack the values so that it is in a 16bit format (ie each color component gets a full 16bits for calculations while in the mmx register). this prevents overflows, and allows better accuracy. i then MULTIPLY by a constant value which is in fixed point and represents 1/9. i take the top 16bits of the calculation. thanfully this also takes care of the shift. then i repack the values (an mmx thing) to the normal 8bit per color componnet. this allows me to work on all the color components simultanously instead of spliting them and doing the calculations seperatly.

without mmx, you can still use some clever tricks, first off instead of seperating color components to r, g, b. try

rb = color&0x00FF00FF;
g = (color&0x0000FF00)>>8;

this allows you to add all the rb pairs using a single add. you dont need to worry about iverflow since that will never fill a 16bit value.

then you multiply by constant in fixed point. with mmx this was easy since i could use 65535/9 and take the top 16bits and skip the shift (since the different muls in mmx allow you to decide which half to take of the 32bit answer). in your case you have to use 0x1C. its less accuarcy but should be sufficent. if you sepearte all the color components completely you can use 0x1C72 for much better accuracy.

g *=0x1C;

finish fixeed point mul up

color = rb|g;
if you dont understand this, read up on fixed point math, its very helpfully.

when doing bumpmapping and other effects you actually create workloads that arenot linear with resolution. also the z buffer probably is slowing you down as well. consider rendering at half res and interpolate values. this may be faster. also on anything that gets converted to int for array access on a per pixel basis should be done in fixed point. conversion from float to int is slow.

also, side note. the mmx register looks like this.
aarrggbb since mmx registers are 64bits long and have special instructions to group bytes (in 8bit 16bit 32bit or 64bit) blocks and work on them. you can also dictate how over/underflow is handled. either the value is capped to the highest/lowest or natural wrap around occers. you can do adds, subs, muls, and shifts in this manner. i dont think mmx has a div instruction. though you have to use a few instructions to convert from the packed 32bit argb format. then at the end some more work to reconvert back to 32bit packed format. i am no mmx expert, but know it well enough to optmixe some things even if i am not using mmx to its full potenial. take a look at the 16bit mmx alphablend article on this site. its quite imformative and should get you started if you decide to go the mmx route if you can get mmx working on your compiler through inline asm.

Share this post

Link to post
Share on other sites
brewski    122
Hey, this mmx-stuff, blur and rgb optimizations seems like really great stuff, never heard of
those before/used mmx , gotta look into that. I'm using fix.point a lot though, shouldn't be a prob.
( will post a real reply tomorrow ->
It all seems to kinda slip my mind when reading it right now, I'm new to mmx and
i'm also loaded like hell :-)

I've "aquired" MSVC++ just now too, to try out some inline asm with mmx,
cause dev-c++ only accepts asm in at&t syntax which is something I won't waste time
on learning.

Anyhow, I've actually got something of interest to post at this time, something I checked
out in response to you stating int=(int)float is slow.
It doesn't really belong to
this forum though.
( Please don't cut, moderator :-)

float -> int shouldn't be that slow,
all that's required is a FISTP opcode, right?!, and according to this info :
FISTP waste 6 cycles regardless size of the
receiving adress (8,16,32,64 bits).
( FIST & reg2mem is 1 cycle )

By the way, I assume the cycle count from the above source is for pentium I,

( quote from index-page:And in the middle of 1996 we have 166 MHz Pentiums with
the same 66 MHz bus and Pentium Pro's (P6) at 200 MHz with a 66 MHz bus.
In other words, the processors are more powerful. )

so for a p4 or amd+ I would assume much less.

Well, curious, I checked the disassembly from this test-code in MSVC++

int main()
float a=15.0f;
int b=(int)a;
return 0;


// From disassembly:

// float a=15.0f;
00401028 mov dword ptr [ebp-4],41700000h
// int b=(int)a;
0040102F fld dword ptr [ebp-4]
00401032 call __ftol (004010fc)
00401037 mov dword ptr [ebp-8],eax

// Not what I wanted.
// I try to make MSVC++ use registers instead...

int main()
register float a=15.0f;
register int b=(int)a;
return 0;


// From disassembly:

// register float a=15.0f;
00401028 mov dword ptr [ebp-4],41700000h
// register int b=(int)a;
0040102F fld dword ptr [ebp-4]
00401032 call __ftol (004010fc)
00401037 mov dword ptr [ebp-8],eax

It looks awefully similar to the first compile doesn't it?
-> Shame on you Microsoft!

Now this is the vastly faster code I would've coded:

fld ['adress']
fistp word ptr ['temporary']
mov ax,['temporary']

That's obvious, what all coders would do, I'm just kinda put down by discovering
the fact that the compiler doesn't do what you tell it/doesn't produce speedy code,
cause that's what I always been lured to believe when reading books/articles by c++gurus,
Statements like ->

Don't try to recode ansiC++ library routines cause they are so optimized you would waste your time.

Don't use 'register' keyword, the compiler knows when to put values into register and when not to.


[edited by - brewski on April 30, 2002 5:30:51 PM]

[edited by - brewski on April 30, 2002 5:34:16 PM]

Share this post

Link to post
Share on other sites