So. Here's what's going on, I've got a little memory buffer object and I want to make it as fast as I can manage in .NET.
I've thrown together a this little benchmark program that you see in the picture above. Basically it runs 3 tests, each test writes 500,000 vertices @ 56 bytes per vertex to the memory buffer and each write is repeated 1000 times. The 3 tests are as follows:
1. Write using GC.Alloc to pin the source vertex array and the destination byte array for every iteration of the 1000 iteration loop and use unsafe code to copy the data.2. Pin the arrays outside of the 1000 iteration loop, but essentially the same thing as test 1.3. Prepare the vertex data as an array of 28,000,000 bytes (56 bytes * 500,000) and use Buffer.BlockCopy.
As you can see, it's not exactly swift and I expect these numbers are probably not as good as they could be, especially compared to C++. You'll note that the pinning of the arrays per iteration of the loop really has very little impact on my time compared to pinning once. What I was surprised at was the difference in time between the Buffer.BlockCopy and my own code. 4 seconds seems like a lot. I'm not sure if given the amount of data and iterations that these numbers are good or bad or if the difference between 1st and 3rd tests are really that big of a deal. Thus I want some expert opinions on this.
I've never been good with optimization. In fact I'm horrible at it, so there's a very high probability that I'm not doing something that I should be to improve performance.
Here's the Writing code:
public override void Write(T[] data, int startIndex, int count){ int dataSize = Marshal.SizeOf(typeof(T)) * count; GCHandle srcArrayHandle = GCHandle.Alloc(data, GCHandleType.Pinned); try { Write(Marshal.UnsafeAddrOfPinnedArrayElement(data, startIndex), dataSize); } finally { srcArrayHandle.Free(); }}// Note: _lockPointer is an IntPtr to a pinned // destination byte array and is set up when the // Lock() function is called public override void Write(IntPtr pointer, int count){ unsafe { byte* src = (byte*)pointer.ToPointer(); byte* dest = (byte*)_lockPointer.ToPointer(); MemCopy(src, dest, count); }}
and here's the memcopy code:
private unsafe void MemCopy(byte *src, byte *dest, int count){ if (count >= 16) { do { if (IntPtr.Size == 4) { int* intSrc = (int*)src; int* intDest = (int*)dest; *intDest = *intSrc; *(intDest + 4) = *(intSrc + 4); *(intDest + 8) = *(intSrc + 8); *(intDest + 12) = *(intSrc + 12); } else { long* longSrc = (long*)src; long* longDest = (long*)dest; *longDest = *longSrc; *(longDest + 8) = *(longSrc + 8); } src += 16; dest += 16; count -= 16; } while (count >= 16); } if ((count & 8) != 0) { if (IntPtr.Size == 4) { int* intSrc = (int*)src; int* intDest = (int*)dest; *intDest = *intSrc; *(intDest + 4) = *(intSrc + 4); } else { long* longSrc = (long*)src; long* longDest = (long*)dest; *longDest = *longSrc; } src += 8; dest += 8; } if ((count & 4) != 0) { int* intSrc = (int*)src; int* intDest = (int*)dest; *intDest = *intSrc; src += 4; dest += 4; } if ((count & 2) != 0) { short* shortSrc = (short*)src; short* shortDest = (short*)dest; *shortDest = *shortSrc; src += 2; dest += 2; } if (count == 1) { *dest = *src; src++; dest++; }}
Note that here I try to force it to use long when I'm on a 64 bit system (which I am). I don't know if this would be a performance boost or not, didn't really seem to make much a difference when I took it out, so I left it in.
Feel free to chime in with any advice on how to make this faster or if you want to use this code, then by all means have at it.