Memory transfer optimizations

hey currently i''m using a movq loop to transfer memory (sysmem-sysmem and sysmem-vidmem) I use ddraw. My question is: 1:is there a faster way to copy memory and/or can someone give me *the best* optimized code for it? i have also heard about prefetch instruction (or so..) can i use it under vc++?(or do i need masm/whatever) how does it exactly work? anyone with an example? and the second part is about sysmem-vidmem copy. my sysmem-sysmem copy routine is almost twice the speed of sysmem-vidmem , is there a way to optimize it? any tricks? or maybe using agp in 2d graphic? thnx for all answers/help.

