• Advertisement

Archived

This topic is now archived and is closed to further replies.

Optimisation we've been discussing recently

This topic is 6214 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Here, opengl.org and at flipcode there has been at lot of noise about a ''faster X'', like the memcpy thread, their flip''s faster casting, at opengl fastest normalisation etc, etc Has anyone ever collect all these in one place? (Besides tcs). Nvidia has a fastmath.h, tcs has one, most people have a few bits kicking around but without being an assemble programmer it''s hard to tell the difference between one fastest asm square root and another. What other resources are ther out there for this kind of thing? For example my math library is currently written for readability. Soon however I''ll be writing a structure of classes that Allow me to define one routine and the fasted platform dependedent code fragment will be run. I currently have a large detailed CPU class that detects dozens of processor types, times the CPU, counts instructions, etc, etc. It''s a bit of a mess now but when it''s clean what other resource can I put in it? Here are the optimised bits of code I''m aware of: My asm timer (which needs a little work but handles threading) Nvidia fast math TCS''s fast math 3Dnow! SDK ?SSE SDK (Never did find that one) Fastest Memset, memcpy here Fastest Float to int (flipcode) SSE Matrix library Fastest Normalisation(opengl.org forums) Fastest Power of Two What else would fit well in to this kind of scheme? Are you aware of any links. Depending on the interest I''m happy to release the results of this quest to the public. Something I know I need: A good way of telling a telling a function declaration to use either the e.g AMD FastMemCpy32 or SSE FastMemCpy32. Function pointers? Static libs and have 2 exe''s? Idea''s? I can do the grunt work of gathering the data organising and making it look pretty in code but I DONT know more than assembler basics so can''t tell more than which code fragment is fastest. Many thanks, and I hope this turns out to be something useful. Chris

Share this post


Link to post
Share on other sites
Advertisement
I think you''d want to use function pointers, LaMonthe says they''re better than case statements.

I think you''d want to make a dll (or two or three) not a static lib. That way you can add support for more cpu''s (when the P4 is widely available, or the K8...) a little easier. You can determine the processor & load the appro. dll at run time then - rather than have seperate builds. If you use function pointers & one build, it''s no faster than a dll.

... might kill performance though. I usually inline most of my math ops - you don''t want to add function call overhead to a vector normalization.


You want to be able to change inlined code at run time. That''s not impossible, it''s also not easy. I don''t think you''re allowed to write to the code area from ring 3 (a user app) - but you can (usually not on purpose) from a driver in ring 0. Sooo, you could call a proxy function and insert a bunch of nops where ever you want to use speciailize code. Then, you send the list of proxy locations & a functions tag to the ''code stomper'' driver, he has a look at the cpu type etc, and loads the dll and copies the specialized code to each spot in the list and adds a jump to the end of the nop''s.

Maybe you could just make specialized builds & have the installer deterimine which one to run.

...
Cublic spline interpolation?




Magmai Kai Holmlor
- The disgruntled & disillusioned

Share this post


Link to post
Share on other sites
Try looking up the kernel function VirtualAlloc and taking note of the flProtect parameter. Would it be possible to load in a DLL''s functions directly into the EXE''s memory, replacing the generic functions at run-time? This would take less space for the binaries than if you had compiled multiple EXE''s for this purpose, and as far as I know it would be faster than if you had placed the functions into DLLs.

Hmm...I had read about compressed EXE files and the like, so I think it is possible for an EXE to modify its own code. It does sound fairly complicated, though. Anybody know more about this?

Share this post


Link to post
Share on other sites
quote:

Would it be possible to load in a DLL''s functions directly into the EXE''s memory, replacing the generic functions at run-time?



It''d be a lot easier just to have a small "stubs" DLL, that did some CPU detection, then just hooked all of its API''s to the appropriate DLL.

Share this post


Link to post
Share on other sites
Any of the asm people here know that the instruction overhead is of calling a function straight from a DLL?

This seems like the cleanest solution so far. (I don''t wan''t the library to be so complicated that people will be afraid of using it. Of course the overhead of the switching mechanism will decide how efficient it is to use a CPU specialise function in the library. If for example it takes 10 clocks to call a 10 clock faster sqrt then nobody would bother.

So, anyone willing to look the instruction count of calling a function in a dll dynamically bound at runtime (not including the initial dll binding.)?

In honest I''v never needed DLL''s before this (I have a background in COM though). Here is how I think I would need to load the DLL functions:

void *DllLoad(const std::string &a_Name)
{
Handle = LoadLibrary(a_Name.c_str());
return (void *)Handle;
}

then

void *DllGetFunction(void *handle, const std::string &a_Name)
{
return GetProcAddress((HMODULE)handle, a_Name.c_str());
}

After that however I think the void* pointer it hands back is actually just a code pointer(function block) and can be used in just the same way as a normal code block (ie I think it''s just a bunch of instructions at a memory address so it -should- be the same shouldn''t it. If thats correct we''re only talking a function pointer.

Isn''t that just as fast at runtime? I mean, a normal function(not inlined) call is just a jump to a code block at a different address.

Anyone want to set me straight as to what I''ve said? Anyone want to read an asm listing of a dll function call to confirm?


Thanks all...

Chris

Share this post


Link to post
Share on other sites
There is no SSE SDK. SSE is a set of instructions that Pentium 3s have. You write assembly code that takes advantage of those instructions. The same goes for MMX and the Pentium 4''s SSE2. MMX is integer-only and somewhat limited, whereas SSE and SSE2 deal exclusively with floating-point numbers. SSE[2] is not a replacement for MMX, in fact the Pentium 3 also added new, very useful, MMX instructions.

~CGameProgrammer( );

Share this post


Link to post
Share on other sites
The overhead of calling a dll function is the same as using a function pointer.

You''d want to dynamically laod the dll, not statically load it too, btw. But that''s not a good solution for things like vector normalization - you''d have to load whole sections of the game from the dlls, to minimize the function call overhead...

Magmai Kai Holmlor
- The disgruntled & disillusioned

Share this post


Link to post
Share on other sites

  • Advertisement