Jump to content

  • Log In with Google      Sign In   
  • Create Account


Disk stalling on rapid r/w


Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.

  • You cannot reply to this topic
20 replies to this topic

#1 Khatharr   Crossbones+   -  Reputation: 2768

Like
0Likes
Like

Posted 06 November 2012 - 08:01 PM

I've got a small pet project that I've been playing with just for personal interest. It reads a file from the disk, performs rapid key-based xor encryption on the contents and then saves it to disk as a new file.

I recently optimized the encryption code and it now executes almost immediately (even for 64mb chunks), so the read and the write are very close together.

When I compile the release version with standard VS2008 optimizations enabled I notice that the program causes a disk stall upon the second read operation that can take up to a full second to resolve (while the hard disk complains audibly). Needless to say it's disturbing, and it occurs with any large files that I use it on, so I don't think it's a bad sector or anything.

Has anyone encountered this before? Is there something I should tell Windows to get it to calm down a bit or should I just back off the r/w frequency a bit?

For reference, I'm using the win32 (XP-SP2) file IO system (CreateFile(); ReadFile(); WriteFile(); CloseHandle(); etc). I'm using synchronous IO (yes, I know that async would invalidate the need for an optimized encryption algorithm, but it's not fun - which is my main goal in this case). The disk is an NTFS Terabyte Sata6 volume.

Any advice would be appreciated. :)
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

Sponsor:

#2 Bacterius   Crossbones+   -  Reputation: 7963

Like
0Likes
Like

Posted 07 November 2012 - 07:38 AM

64mb chunks is far too large. You should be reading the file by blocks of 16KB or so, perhaps 64KB, any more is wasted memory use. You simply won't go any faster. Besides, with a simple XOR encryption, you should be encrypting orders of magnitude faster than your hard drive can read or write. Can you show your code? Perhaps there's something there (also, you can specify some special flags to CreateFile if you want to go a little bit faster, there are hints you can give to the operating system to help it manage your file depending on your needs - for instance, you can disable caching, or tell it you'll be reading the file sequentially).

Edited by Bacterius, 07 November 2012 - 07:39 AM.

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#3 samoth   Crossbones+   -  Reputation: 4464

Like
-1Likes
Like

Posted 07 November 2012 - 08:15 AM

64MiB chunks are normally not too large, by any means. I've been writing 200 MiB chunks and been modifying 700 MiB chunks with memory mapping, all of which works just fine with no measurable delay. Unless I explicitly sync, writeback is totally invisible on my machine (WinXP SP3), so it must all be lazy writes, much like e.g. under Linux.

Is there a chance that your computer is a bit low on RAM? I'm asking because of WinXP SP2, which admittedly is a bit elderly (heck, even my WinXP SP3 is a dinosaur!).

Reads and writes are served from the buffer cache, which is internally (without you knowing) implemented via memory mapping. Normally, "synchronous" really means "instantly" (or nearly so), because all that happens in WriteFile is a memcpy to the buffers.
Of course, if the machine doesn't have enough physical RAM to hold all of the data (or rather, 2x as much) in one go, it can only copy a part of your data, and then it must write back dirty pages. Only when that is done, it copies the rest of the data to the buffers. In the mean time, your application is stalled (that's really "synchronous" then).

Tip: You can reduce the amount of physical RAM needed by 1/2 by using memory mapping. In that case, you just modify what's mapped, and close the mapping. The data will find its way to the harddisk "magically".

About XOR encryption per se, it's probably not necessary to hint you to the fact that it's immensely poor.

Edited by samoth, 07 November 2012 - 08:17 AM.


#4 Khatharr   Crossbones+   -  Reputation: 2768

Like
0Likes
Like

Posted 07 November 2012 - 09:49 PM

Thank you for your responses.

I can typically allocate well over 500MB without trouble. In this case I don't think it's a memory issue since the same buffer is being used in all cases. The program starts with a 64MB baseline, then if the input file is smaller it clips to that length. Once it has determined the desired length it starts allocation attempts, cutting the length in half with each failure until it either gets a success or reaches something like 1KB (I forget where I set the cutoff.)

At any rate I don't think it's a memory issue.

Memory mapping sounds interesting. I'll have to look into it.

Really what I'm concerned about, though, is why the stall is happening, so that I can be aware of what limitation is causing this.

Oh, yeah. I know the encryption isn't grand. This project is just a programming toy, really. Posted Image

The optimization is essentially a runtime compile of the encryption code. The program writes optimized bytecode into a char array and then calls it as a function. I'll bring it along next time I pass through just so you can get a chuckle out of it. It actually is an enormous optimization in terms of execution time per key-application, and for a 500MB file it saves about 20 seconds, which makes the thing really snappy.

The problem, of course, is that no matter how optimal the encryption is, using a circular buffer with async file reads would be faster.

Edited by Khatharr, 07 November 2012 - 09:54 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

#5 samoth   Crossbones+   -  Reputation: 4464

Like
0Likes
Like

Posted 08 November 2012 - 03:53 AM

enormous optimization [...] for a 500MB file it saves about 20 seconds, which makes the thing really snappy.

Most block ciphers have implementations that run at speeds upwards of 200 MiB per second. I have an implementation of an optimized block cipher (pure SSE2 implementation) that encrypts/decrypts at rates close to 1 GiB/s.

XOR should, depending on how fast your RAM is, do at least 4-6 GiB/s (or possibly twice that on a modern machine). Saving 20 seconds suggests that your normal runtime is at least 20 seconds (otherwise you could not possibly save 20 seconds), so something must be wrong there. XOR-ing 500 MiB should be "instantaneous" without doing any special optimizations, assuming the data is in RAM.

The problem, of course, is that no matter how optimal the encryption is, using a circular buffer with async file reads would be faster.

No. Although DMA may save you a memcpy if you use unbuffered overlapped reads/writes, and you may get reasonably good rates, overlapped IO as such is not faster (it is, however, more complicated). In fact, buffered overlapped IO is even considerably slower than buffered synchronous IO.
The assumption that directly DMAing into your address space is advantageous is correct in some special cases, for example when you stream large amounts of data off an optical disk and have harsh latency requirements. However, for the general case, and in particular for reading from a hard disk on a computer that is not struggling with low memory conditions, using the buffer cache is a total win.

File mapping is the best you can possibly do. The operating system does the readahead and the writeback without you worrying or even knowing, and it's as efficient as it gets.

#6 Bacterius   Crossbones+   -  Reputation: 7963

Like
0Likes
Like

Posted 08 November 2012 - 04:09 AM

Most block ciphers have implementations that run at speeds upwards of 200 MiB per second. I have an implementation of an optimized block cipher (pure SSE2 implementation) that encrypts/decrypts at rates close to 1 GiB/s.

Off-topic but what block cipher would that be? Multithreaded or per core? (what CPU). Just curious because that is unusually high for a block cipher (a hand-written assembly implementation of Threefish256 gives roughly 350MB/s per-core, and hardware-accelerated AES is about the same).

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#7 samoth   Crossbones+   -  Reputation: 4464

Like
0Likes
Like

Posted 08 November 2012 - 11:20 AM

Off-topic but what block cipher would that be? Multithreaded or per core? (what CPU). Just curious because that is unusually high for a block cipher (a hand-written assembly implementation of Threefish256 gives roughly 350MB/s per-core, and hardware-accelerated AES is about the same).

That's per-core. A plain normal Feistel network with 16 rounds using a s-box that fits into 64 lines of L1, and stock techniques as used in "typical" algorithms (modular multiplication, substitution, compression, differently sized shifts, xor/add). It is very fast because it is not implemented after a design document, but written with exactly what SSE2 instructions give you, nothing else (well, that, and careful tuning).

It looks "plausibly ok" both from the operations it does and from statistical tests on the output. Although it is possibly (but not necessarily) less secure than a proper algorithm like e.g. AES, that doesn't matter to me. The main requirements are being fast, low footprint, and low overhead for switching keys. Especially the latter is something that almost no established algorithm offers. Key setup in Twofish, for example, is more expensive than the actual encryption for most messages.

So, my own block cipher may be (or not) slightly less secure, but that's fine, it won't guard nuclear secrets. If you can really find an attack that can decrypt my game's network traffic given a week of number crunching using a cluster, that's nice for you, but it won't give you much :-)
It's of course highly non-portable (though one could probably re-implement it very inefficiently in plain C, too), but that is fine for me, too.

Edited by samoth, 08 November 2012 - 11:22 AM.


#8 Bacterius   Crossbones+   -  Reputation: 7963

Like
0Likes
Like

Posted 08 November 2012 - 05:21 PM

@samoth: thanks! I assume you don't have a small design document of your own, is your algorithm proprietary or do you have the code somewhere? I'd be interested in looking at it :)

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#9 Khatharr   Crossbones+   -  Reputation: 2768

Like
0Likes
Like

Posted 08 November 2012 - 08:53 PM

The 20s number is an estimation by cycle-count for the bytecode in terms of performance on my 2.41GHz (running on a single core). I haven't actually benched it against the original, although I'm now considering it. There was a noticeable increase in speed, though. The original version using nested loops compiled out to something like 80 cycles per byte whereas the optimized one is 9 cycles per 4 bytes.

At any rate, I'm confident that the existing cryp code is at the limits of optimization for the 32-bit range (not really looking at SIMD here yet). Basically it boils down to a long series of 'xor [edi],imm32' with the occasional 'xor [edi],imm8' in cases where it's appropriate. (Interspersed with 'add/inc edi' ofc.)

So you're saying that starting an async read on a second buffer while the cryp is running wouldn't essentially invalidate the speed of the cryp? I'm thinking that this would mean that I'm very much confused by what 'asynchronous' means then. What I mean is that regardless of the speed of the cryp code, so long as its faster than a load-from-disk (which it always is, barring nonsense) then reading the next chunk from disk while the previous chunk is being modified means that there's close to zero wait between r/w.

I know overlapped runs slow as hell if you have multiple IO operations competing. I'd only async the read operations and just do the writes synchronously. Would (r+cryp)->w really be slower than r->cryp->w? Honestly I don't know what the overlapped stuff is doing under the hood. I just assumed that it was a form of threaded disk IO. You're saying it's for disk DMA?
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

#10 Bacterius   Crossbones+   -  Reputation: 7963

Like
0Likes
Like

Posted 08 November 2012 - 09:01 PM

Can you show your encryption code?? 9 cycles per 4 bytes is certainly not at the limits of optimization for 32-bit processors. With a single XOR operation you should be looking at much less than that, even without using SIMD instructions...

The slowsort algorithm is a perfect illustration of the multiply and surrender paradigm, which is perhaps the single most important paradigm in the development of reluctant algorithms. The basic multiply and surrender strategy consists in replacing the problem at hand by two or more subproblems, each slightly simpler than the original, and continue multiplying subproblems and subsubproblems recursively in this fashion as long as possible. At some point the subproblems will all become so simple that their solution can no longer be postponed, and we will have to surrender. Experience shows that, in most cases, by the time this point is reached the total work will be substantially higher than what could have been wasted by a more direct approach.

 

- Pessimal Algorithms and Simplexity Analysis


#11 Khatharr   Crossbones+   -  Reputation: 2768

Like
0Likes
Like

Posted 08 November 2012 - 09:30 PM

I still haven't had a chance to go home and get the project yet, but the code is:

xor [edi],imm32 #7 cycles (imm-to-memory)
add edi,4 #2 cycles (imm-to-register)



From 80386 Programmer's Reference Manual:

Exclusive-OR immediate dword to r/m dword
XOR r/m32,imm32 2/7

Add immediate dword to r/m dword
ADD r/m32,imm32 2/7

Edited by Khatharr, 08 November 2012 - 09:32 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

#12 samoth   Crossbones+   -  Reputation: 4464

Like
0Likes
Like

Posted 09 November 2012 - 05:39 AM

@samoth: thanks! I assume you don't have a small design document of your own, is your algorithm proprietary or do you have the code somewhere?

No design doc exists, if anything one would have to write the design doc after the implementation. Also, no publicly available sources exist at that time, sorry.

xor [edi],imm32 #7 cycles (imm-to-memory)
add edi,4 #2 cycles (imm-to-register)

OK; so assuming that this very inefficient code is not at all pipelined, that's 9 cylces (realistically it's more like 7). At a clock speed of, say, 2 GHz, that will be 211 MiB/s. That's little over 2 seconds to encode 500 MiB.

Plain C code will easily run that fast, and likely faster. Note that this kind of code (xor with single immediate) is exactly what a modern compiler will trivially vectorize (and most probably unroll 8 times), hiding pipeline stalls and being 100% memory bandwidth limited.

Edited by samoth, 09 November 2012 - 05:39 AM.


#13 samoth   Crossbones+   -  Reputation: 4464

Like
0Likes
Like

Posted 09 November 2012 - 05:47 AM

This is what my C compiler (gcc 4.7.1) produces for a simple for(...) x[i] ^= immediate; loop without doing anything special:

L2:
.loc 1 122 0 discriminator 2
movdqa (%eax), %xmm0
pxor %xmm1, %xmm0
movdqa %xmm0, (%eax)
addl $16, %eax
cmpl %edx, %eax
jne L2

Edited by samoth, 09 November 2012 - 05:49 AM.


#14 Khatharr   Crossbones+   -  Reputation: 2768

Like
0Likes
Like

Posted 09 November 2012 - 03:06 PM

The original was a nested loop since it had to rotate through the key values. It had the overhead of both loops, continually checking whether it was at the end of the key or chunk and it had to fetch the key values from memory in order to apply them since the key is provided at runtime. The optimized version works by compiling the key into the bytecode as immediate values and unrolling both loops to create a straight-line procedure for the entire chunk. The end result looks like:

[source lang="ruby"]#startpush edimov edi,esp+8#section repeated (literally repeated, not looped) N timesxor [edi],imm32add edi,4.....#tail sectionpop ediret 4[/source]

Then once the buffer is loaded I just

unsigned char* chunk = chunk_ptr;

unsigned char* bytecode = bc_storage;

__asm {

  push chunk

  call bytecode

}


(It's necessary to duplicate the pointers because the 'real' ones are class members so MASM doesn't recognize them.)

The chunk length is allocated as a multiple of the key length, so there's no fiddly carry-overs. The same bytecode can be applied to every full chunk. On the final chunk (which is usually less than the size of a full chunk) the 'tail' section gets copied into the appropriate location to truncate the bytecode.

Actually [embarassed] I just thought of another optimization (it's an illness). Presently if the key is not of a length which is a multiple of 4 then either a zero extended dword or a byte-length version of the xor is used to end the key. It occurred to me that I could just duplicate the key within its own buffer until its length is a multiple of 4 prior to composing the bytecode.

So a key like "testkey" of length 7 would become "testkeytestkeytestkeytestkey" of length 28, eliminating the need for partial 32's or for 8's.

Eventually this will become so optimal that all I'll have to do is hover over the icon and the process will have been completed the day before. -.-

Edited by Khatharr, 09 November 2012 - 03:35 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

#15 Erik Rufelt   Crossbones+   -  Reputation: 3018

Like
0Likes
Like

Posted 09 November 2012 - 03:39 PM

As for your stalling problem, reading or writing 64 MB to a hard-drive probably takes about 1 second, so likely your data is flushed to disk at that time. Since you say it occurs on the second read close after a write, I would guess that the system file cache is full and the OS decides to flush your recently written data to disk before reading in another 64 MB into the cache.

#16 Khatharr   Crossbones+   -  Reputation: 2768

Like
0Likes
Like

Posted 09 November 2012 - 09:40 PM

Could be. It's the whining of the disk that sort of freaks me out. It makes this sound like "wah-wah-wah-wah-wah" while it's stalling. After that one stall though it does all of the rest of the read/write operations without any complaints.

Edited by Khatharr, 09 November 2012 - 09:40 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

#17 Khatharr   Crossbones+   -  Reputation: 2768

Like
0Likes
Like

Posted 13 November 2012 - 04:14 PM

Here's the project if you still want to look at it.

https://docs.google....aGMxckhGTkR5bWM

The relevant code is in "cl_Bink.cpp".

It should go without saying that this won't run on a 64-bit machine, but I'm saying it anyway.

Also, I benched the original, unoptimized version against the current version with a 1GB file from the start of the copy/encrypt process to the end of it:

original: 30.045313 sec
optimized: 16.677706 sec

Not as good as I'd gathered from the cycle-count, but I'm thinking that I was probably looking at debug disassembly when I did the count (like a noob), so the compiler probably gives the original a bit of a speedup.

Edited by Khatharr, 13 November 2012 - 04:58 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

#18 samoth   Crossbones+   -  Reputation: 4464

Like
0Likes
Like

Posted 14 November 2012 - 05:36 AM

This example, for comparison, uses memory mapping (apologies for the very stripped down version of the MappedFile class, I would otherwise have had to include a lot more code with some dependencies -- the original has a few more features).
[source lang="cpp"]#include <windows.h>#include <stdio.h>#include <time.h>#include <stdint.h>class MappedFile{ HANDLE file; HANDLE mapping; void* data; uint32_t len; unsigned int mode;public: enum map_mode_t { readwrite }; MappedFile(const char *name, map_mode_t map_mode = readwrite) : file(), mapping(), data(), len(), mode(map_mode) { Open(name, map_mode); } ~MappedFile(){ Close(); } void Open(const char *name, map_mode_t map_mode = readwrite); void Close(); void* Map(unsigned int map_size = 0); void UnMap(); void* Data() const { return data; }; uint32_t Length() const { return len; };};void MappedFile::Open(const char *name, map_mode_t map_mode){ Close(); mode = map_mode; int os_mode = GENERIC_READ | GENERIC_WRITE; file = CreateFile(name, os_mode, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_ALWAYS, FILE_ATTRIBUTE_NORMAL, 0); len = GetFileSize(file, 0); Map();}void MappedFile::Close(){ UnMap(); CloseHandle(file); file = 0;}void* MappedFile::Map(unsigned int map_size){ int os_mode = PAGE_READWRITE; int os_mode2 = FILE_MAP_WRITE; UnMap(); mapping = CreateFileMappingA(file, 0, os_mode, 0, map_size, 0); data = MapViewOfFile(mapping, os_mode2, 0, 0, map_size); return data;}void MappedFile::UnMap(){ if(data) { UnmapViewOfFile(data); CloseHandle(mapping); data = 0; }}int main(){ printf("%u create mapping\n", time(0)); MappedFile f("test.dat"); if(!f.Data()) { puts("fail."); return 1; } printf("len = %u (%u MiB)\n", f.Length(), f.Length()/(1024*1024)); printf("%u begin xor\n", time(0)); for(char* c = (char*) f.Data(); c < f.Data() + f.Length(); ++c) *c ^= 'x'; printf("%u end xor, closing mapping\n", time(0)); f.Close(); printf("%u finished\n", time(0)); return 0;}[/source]

Compiled using Code::Blocks / MinGW-TDM 4.7.1-1, no special stuff, just pressing the "build" button, on a 7200rpm Samsung SATA-150 disk.

Output:

1352892208 create mapping
len = 546904331 (521 MiB)
1352892208 begin xor
1352892208 end xor, closing mapping
1352892208 finished
Process returned 0 (0x0) execution time : 0.891 s
Press any key to continue.

These timings show three things:
1. The Sysinternals Cacheset program is broken (I used this to clear the cache prior to the test)
2. The data is quite obviously fetched from the buffer cache.
3. Writing (or modifying) half a gigabyte worth of data does not "stall" (if the computer has sufficient RAM), just like I told you.

Edited by samoth, 14 November 2012 - 05:38 AM.


#19 samoth   Crossbones+   -  Reputation: 4464

Like
0Likes
Like

Posted 14 November 2012 - 05:52 AM

Numbers after rebooting (cache is certainly gone now):

1352893547 create mapping
len = 546904331 (521 MiB)
1352893547 begin xor
1352893557 end xor, closing mapping
1352893557 finished
Process returned 0 (0x0) execution time : 10.234 s
Press any key to continue.

This shows the following:
1. The disk stinks... 50 MiB/s, I really need to buy a new one some day (though I don't know whether it's fragmented)
2. Closing the mapping still takes "zero time", no stall visible for writing back the data.


EDIT:
Defraggler tells me that test.dat has 371 fragments, so maybe defragmenting would indeed have given a bit less of a "disk stinks" impression (considering that probably around 2 seconds are now being spent in seeking, which is 20%). Either way, it doesn't matter much for the result.

EDIT 2:
Well, wow... the gains from defragmenting are more like 50%.

1352894459 create mapping
len = 546904331 (521 MiB)
1352894459 begin xor
1352894464 end xor, closing mapping
1352894464 finished

Process returned 0 (0x0) execution time : 5.469 s
Press any key to continue.

Edited by samoth, 14 November 2012 - 06:03 AM.


#20 Khatharr   Crossbones+   -  Reputation: 2768

Like
0Likes
Like

Posted 14 November 2012 - 05:52 PM

Thanks, I've been reading a little about mapping in the MSDN docs. Posted Image

Probably I'll try to implement it next to see what the performance difference is like. Your example gives good execution times but it's simply xor'ing an existing (or newly created) file. I'd have to copy the file from the original to make that work with my imp. I'll have to fiddle around with how to get the best result from it. (copy first then mod vs read-to-map then mod, etc)

Also, concerning the stall, it's funny that you mentioned defraggler. I defragged the disk in question a couple days ago and the stall went away.

Edited by Khatharr, 14 November 2012 - 06:10 PM.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.




Old topic!
Guest, the last post of this topic is over 60 days old and at this point you may not reply in this topic. If you wish to continue this conversation start a new topic.



PARTNERS