Disk stalling on rapid r/w

Started by
11 comments, last by Khatharr 11 years, 5 months ago
I still haven't had a chance to go home and get the project yet, but the code is:

xor [edi],imm32 #7 cycles (imm-to-memory)
add edi,4 #2 cycles (imm-to-register)



From 80386 Programmer's Reference Manual:

Exclusive-OR immediate dword to r/m dword
XOR r/m32,imm32 2/7

Add immediate dword to r/m dword
ADD r/m32,imm32 2/7
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
Advertisement
@samoth: thanks! I assume you don't have a small design document of your own, is your algorithm proprietary or do you have the code somewhere?
No design doc exists, if anything one would have to write the design doc after the implementation. Also, no publicly available sources exist at that time, sorry.

xor [edi],imm32 #7 cycles (imm-to-memory)
add edi,4 #2 cycles (imm-to-register)
OK; so assuming that this very inefficient code is not at all pipelined, that's 9 cylces (realistically it's more like 7). At a clock speed of, say, 2 GHz, that will be 211 MiB/s. That's little over 2 seconds to encode 500 MiB.

Plain C code will easily run that fast, and likely faster. Note that this kind of code (xor with single immediate) is exactly what a modern compiler will trivially vectorize (and most probably unroll 8 times), hiding pipeline stalls and being 100% memory bandwidth limited.
This is what my C compiler (gcc 4.7.1) produces for a simple [font=courier new,courier,monospace]for(...) x ^= immediate;[/font] loop without doing anything special:

L2:
.loc 1 122 0 discriminator 2
movdqa (%eax), %xmm0
pxor %xmm1, %xmm0
movdqa %xmm0, (%eax)
addl $16, %eax
cmpl %edx, %eax
jne L2
The original was a nested loop since it had to rotate through the key values. It had the overhead of both loops, continually checking whether it was at the end of the key or chunk and it had to fetch the key values from memory in order to apply them since the key is provided at runtime. The optimized version works by compiling the key into the bytecode as immediate values and unrolling both loops to create a straight-line procedure for the entire chunk. The end result looks like:

[source lang="ruby"]#start
push edi
mov edi,esp+8

#section repeated (literally repeated, not looped) N times
xor [edi],imm32
add edi,4

.....


#tail section
pop edi
ret 4[/source]

Then once the buffer is loaded I just

[source]unsigned char* chunk = chunk_ptr;
unsigned char* bytecode = bc_storage;
__asm {
push chunk
call bytecode
}[/source]

(It's necessary to duplicate the pointers because the 'real' ones are class members so MASM doesn't recognize them.)

The chunk length is allocated as a multiple of the key length, so there's no fiddly carry-overs. The same bytecode can be applied to every full chunk. On the final chunk (which is usually less than the size of a full chunk) the 'tail' section gets copied into the appropriate location to truncate the bytecode.

Actually [embarassed] I just thought of another optimization (it's an illness). Presently if the key is not of a length which is a multiple of 4 then either a zero extended dword or a byte-length version of the xor is used to end the key. It occurred to me that I could just duplicate the key within its own buffer until its length is a multiple of 4 prior to composing the bytecode.

So a key like "testkey" of length 7 would become "testkeytestkeytestkeytestkey" of length 28, eliminating the need for partial 32's or for 8's.

Eventually this will become so optimal that all I'll have to do is hover over the icon and the process will have been completed the day before. -.-
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
As for your stalling problem, reading or writing 64 MB to a hard-drive probably takes about 1 second, so likely your data is flushed to disk at that time. Since you say it occurs on the second read close after a write, I would guess that the system file cache is full and the OS decides to flush your recently written data to disk before reading in another 64 MB into the cache.
Could be. It's the whining of the disk that sort of freaks me out. It makes this sound like "wah-wah-wah-wah-wah" while it's stalling. After that one stall though it does all of the rest of the read/write operations without any complaints.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
Here's the project if you still want to look at it.

https://docs.google....aGMxckhGTkR5bWM

The relevant code is in "cl_Bink.cpp".

It should go without saying that this won't run on a 64-bit machine, but I'm saying it anyway.

Also, I benched the original, unoptimized version against the current version with a 1GB file from the start of the copy/encrypt process to the end of it:

original: 30.045313 sec
optimized: 16.677706 sec

Not as good as I'd gathered from the cycle-count, but I'm thinking that I was probably looking at debug disassembly when I did the count (like a noob), so the compiler probably gives the original a bit of a speedup.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
This example, for comparison, uses memory mapping (apologies for the very stripped down version of the MappedFile class, I would otherwise have had to include a lot more code with some dependencies -- the original has a few more features).
[source lang="cpp"]#include <windows.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>

class MappedFile
{
HANDLE file;
HANDLE mapping;
void* data;
uint32_t len;
unsigned int mode;

public:

enum map_mode_t { readwrite };
MappedFile(const char *name, map_mode_t map_mode = readwrite) : file(), mapping(), data(), len(), mode(map_mode) { Open(name, map_mode); }
~MappedFile(){ Close(); }

void Open(const char *name, map_mode_t map_mode = readwrite);
void Close();

void* Map(unsigned int map_size = 0);
void UnMap();

void* Data() const { return data; };
uint32_t Length() const { return len; };
};


void MappedFile::Open(const char *name, map_mode_t map_mode)
{
Close();
mode = map_mode;
int os_mode = GENERIC_READ | GENERIC_WRITE;
file = CreateFile(name, os_mode, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, 0);
len = GetFileSize(file, 0);
Map();
}

void MappedFile::Close()
{
UnMap();
CloseHandle(file);
file = 0;
}

void* MappedFile::Map(unsigned int map_size)
{
int os_mode = PAGE_READWRITE;
int os_mode2 = FILE_MAP_WRITE;
UnMap();
mapping = CreateFileMappingA(file, 0, os_mode, 0, map_size, 0);
data = MapViewOfFile(mapping, os_mode2, 0, 0, map_size);
return data;
}

void MappedFile::UnMap()
{
if(data)
{
UnmapViewOfFile(data);
CloseHandle(mapping);
data = 0;
}
}

int main()
{
printf("%u create mapping\n", time(0));
MappedFile f("test.dat");

if(!f.Data()) { puts("fail."); return 1; }

printf("len = %u (%u MiB)\n", f.Length(), f.Length()/(1024*1024));

printf("%u begin xor\n", time(0));

for(char* c = (char*) f.Data(); c < f.Data() + f.Length(); ++c)
*c ^= 'x';

printf("%u end xor, closing mapping\n", time(0));
f.Close();
printf("%u finished\n", time(0));

return 0;
}
[/source]

Compiled using Code::Blocks / MinGW-TDM 4.7.1-1, no special stuff, just pressing the "build" button, on a 7200rpm Samsung SATA-150 disk.

Output:

1352892208 create mapping
len = 546904331 (521 MiB)
1352892208 begin xor
1352892208 end xor, closing mapping
1352892208 finished
Process returned 0 (0x0) execution time : 0.891 s
Press any key to continue.

These timings show three things:
1. The Sysinternals Cacheset program is broken (I used this to clear the cache prior to the test)
2. The data is quite obviously fetched from the buffer cache.
3. Writing (or modifying) half a gigabyte worth of data does not "stall" (if the computer has sufficient RAM), just like I told you.
Numbers after rebooting (cache is certainly gone now):

1352893547 create mapping
len = 546904331 (521 MiB)
1352893547 begin xor
1352893557 end xor, closing mapping
1352893557 finished
Process returned 0 (0x0) execution time : 10.234 s
Press any key to continue.

This shows the following:
1. The disk stinks... 50 MiB/s, I really need to buy a new one some day (though I don't know whether it's fragmented)
2. Closing the mapping still takes "zero time", no stall visible for writing back the data.


EDIT:
Defraggler tells me that test.dat has 371 fragments, so maybe defragmenting would indeed have given a bit less of a "disk stinks" impression (considering that probably around 2 seconds are now being spent in seeking, which is 20%). Either way, it doesn't matter much for the result.

EDIT 2:
Well, wow... the gains from defragmenting are more like 50%.

1352894459 create mapping
len = 546904331 (521 MiB)
1352894459 begin xor
1352894464 end xor, closing mapping
1352894464 finished

Process returned 0 (0x0) execution time : 5.469 s
Press any key to continue.
Thanks, I've been reading a little about mapping in the MSDN docs. biggrin.png

Probably I'll try to implement it next to see what the performance difference is like. Your example gives good execution times but it's simply xor'ing an existing (or newly created) file. I'd have to copy the file from the original to make that work with my imp. I'll have to fiddle around with how to get the best result from it. (copy first then mod vs read-to-map then mod, etc)

Also, concerning the stall, it's funny that you mentioned defraggler. I defragged the disk in question a couple days ago and the stall went away.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

This topic is closed to new replies.

Advertisement