Disk stalling on rapid r/w

Started by
11 comments, last by Khatharr 11 years, 5 months ago
I've got a small pet project that I've been playing with just for personal interest. It reads a file from the disk, performs rapid key-based xor encryption on the contents and then saves it to disk as a new file.

I recently optimized the encryption code and it now executes almost immediately (even for 64mb chunks), so the read and the write are very close together.

When I compile the release version with standard VS2008 optimizations enabled I notice that the program causes a disk stall upon the second read operation that can take up to a full second to resolve (while the hard disk complains audibly). Needless to say it's disturbing, and it occurs with any large files that I use it on, so I don't think it's a bad sector or anything.

Has anyone encountered this before? Is there something I should tell Windows to get it to calm down a bit or should I just back off the r/w frequency a bit?

For reference, I'm using the win32 (XP-SP2) file IO system (CreateFile(); ReadFile(); WriteFile(); CloseHandle(); etc). I'm using synchronous IO (yes, I know that async would invalidate the need for an optimized encryption algorithm, but it's not fun - which is my main goal in this case). The disk is an NTFS Terabyte Sata6 volume.

Any advice would be appreciated. :)
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
Advertisement
64mb chunks is far too large. You should be reading the file by blocks of 16KB or so, perhaps 64KB, any more is wasted memory use. You simply won't go any faster. Besides, with a simple XOR encryption, you should be encrypting orders of magnitude faster than your hard drive can read or write. Can you show your code? Perhaps there's something there (also, you can specify some special flags to CreateFile if you want to go a little bit faster, there are hints you can give to the operating system to help it manage your file depending on your needs - for instance, you can disable caching, or tell it you'll be reading the file sequentially).

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

64MiB chunks are normally not too large, by any means. I've been writing 200 MiB chunks and been modifying 700 MiB chunks with memory mapping, all of which works just fine with no measurable delay. Unless I explicitly sync, writeback is totally invisible on my machine (WinXP SP3), so it must all be lazy writes, much like e.g. under Linux.

Is there a chance that your computer is a bit low on RAM? I'm asking because of WinXP SP2, which admittedly is a bit elderly (heck, even my WinXP SP3 is a dinosaur!).

Reads and writes are served from the buffer cache, which is internally (without you knowing) implemented via memory mapping. Normally, "synchronous" really means "instantly" (or nearly so), because all that happens in WriteFile is a memcpy to the buffers.
Of course, if the machine doesn't have enough physical RAM to hold all of the data (or rather, 2x as much) in one go, it can only copy a part of your data, and then it must write back dirty pages. Only when that is done, it copies the rest of the data to the buffers. In the mean time, your application is stalled (that's really "synchronous" then).

Tip: You can reduce the amount of physical RAM needed by 1/2 by using memory mapping. In that case, you just modify what's mapped, and close the mapping. The data will find its way to the harddisk "magically".

About XOR encryption per se, it's probably not necessary to hint you to the fact that it's immensely poor.
Thank you for your responses.

I can typically allocate well over 500MB without trouble. In this case I don't think it's a memory issue since the same buffer is being used in all cases. The program starts with a 64MB baseline, then if the input file is smaller it clips to that length. Once it has determined the desired length it starts allocation attempts, cutting the length in half with each failure until it either gets a success or reaches something like 1KB (I forget where I set the cutoff.)

At any rate I don't think it's a memory issue.

Memory mapping sounds interesting. I'll have to look into it.

Really what I'm concerned about, though, is why the stall is happening, so that I can be aware of what limitation is causing this.

Oh, yeah. I know the encryption isn't grand. This project is just a programming toy, really. tongue.png

The optimization is essentially a runtime compile of the encryption code. The program writes optimized bytecode into a char array and then calls it as a function. I'll bring it along next time I pass through just so you can get a chuckle out of it. It actually is an enormous optimization in terms of execution time per key-application, and for a 500MB file it saves about 20 seconds, which makes the thing really snappy.

The problem, of course, is that no matter how optimal the encryption is, using a circular buffer with async file reads would be faster.
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
enormous optimization [...] for a 500MB file it saves about 20 seconds, which makes the thing really snappy.[/quote]Most block ciphers have implementations that run at speeds upwards of 200 MiB per second. I have an implementation of an optimized block cipher (pure SSE2 implementation) that encrypts/decrypts at rates close to 1 GiB/s.

XOR should, depending on how fast your RAM is, do at least 4-6 GiB/s (or possibly twice that on a modern machine). Saving 20 seconds suggests that your normal runtime is at least 20 seconds (otherwise you could not possibly save 20 seconds), so something must be wrong there. XOR-ing 500 MiB should be "instantaneous" without doing any special optimizations, assuming the data is in RAM.

The problem, of course, is that no matter how optimal the encryption is, using a circular buffer with async file reads would be faster.[/quote]No. Although DMA may save you a memcpy if you use unbuffered overlapped reads/writes, and you may get reasonably good rates, overlapped IO as such is not faster (it is, however, more complicated). In fact, buffered overlapped IO is even considerably slower than buffered synchronous IO.
The assumption that directly DMAing into your address space is advantageous is correct in some special cases, for example when you stream large amounts of data off an optical disk and have harsh latency requirements. However, for the general case, and in particular for reading from a hard disk on a computer that is not struggling with low memory conditions, using the buffer cache is a total win.

File mapping is the best you can possibly do. The operating system does the readahead and the writeback without you worrying or even knowing, and it's as efficient as it gets.

Most block ciphers have implementations that run at speeds upwards of 200 MiB per second. I have an implementation of an optimized block cipher (pure SSE2 implementation) that encrypts/decrypts at rates close to 1 GiB/s.

Off-topic but what block cipher would that be? Multithreaded or per core? (what CPU). Just curious because that is unusually high for a block cipher (a hand-written assembly implementation of Threefish256 gives roughly 350MB/s per-core, and hardware-accelerated AES is about the same).

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

Off-topic but what block cipher would that be? Multithreaded or per core? (what CPU). Just curious because that is unusually high for a block cipher (a hand-written assembly implementation of Threefish256 gives roughly 350MB/s per-core, and hardware-accelerated AES is about the same).
That's per-core. A plain normal Feistel network with 16 rounds using a s-box that fits into 64 lines of L1, and stock techniques as used in "typical" algorithms (modular multiplication, substitution, compression, differently sized shifts, xor/add). It is very fast because it is not implemented after a design document, but written with exactly what SSE2 instructions give you, nothing else (well, that, and careful tuning).

It looks "plausibly ok" both from the operations it does and from statistical tests on the output. Although it is possibly (but not necessarily) less secure than a proper algorithm like e.g. AES, that doesn't matter to me. The main requirements are being fast, low footprint, and low overhead for switching keys. Especially the latter is something that almost no established algorithm offers. Key setup in Twofish, for example, is more expensive than the actual encryption for most messages.

So, my own block cipher may be (or not) slightly less secure, but that's fine, it won't guard nuclear secrets. If you can really find an attack that can decrypt my game's network traffic given a week of number crunching using a cluster, that's nice for you, but it won't give you much :-)
It's of course highly non-portable (though one could probably re-implement it very inefficiently in plain C, too), but that is fine for me, too.
@samoth: thanks! I assume you don't have a small design document of your own, is your algorithm proprietary or do you have the code somewhere? I'd be interested in looking at it :)

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

The 20s number is an estimation by cycle-count for the bytecode in terms of performance on my 2.41GHz (running on a single core). I haven't actually benched it against the original, although I'm now considering it. There was a noticeable increase in speed, though. The original version using nested loops compiled out to something like 80 cycles per byte whereas the optimized one is 9 cycles per 4 bytes.

At any rate, I'm confident that the existing cryp code is at the limits of optimization for the 32-bit range (not really looking at SIMD here yet). Basically it boils down to a long series of 'xor [edi],imm32' with the occasional 'xor [edi],imm8' in cases where it's appropriate. (Interspersed with 'add/inc edi' ofc.)

So you're saying that starting an async read on a second buffer while the cryp is running wouldn't essentially invalidate the speed of the cryp? I'm thinking that this would mean that I'm very much confused by what 'asynchronous' means then. What I mean is that regardless of the speed of the cryp code, so long as its faster than a load-from-disk (which it always is, barring nonsense) then reading the next chunk from disk while the previous chunk is being modified means that there's close to zero wait between r/w.

I know overlapped runs slow as hell if you have multiple IO operations competing. I'd only async the read operations and just do the writes synchronously. Would (r+cryp)->w really be slower than r->cryp->w? Honestly I don't know what the overlapped stuff is doing under the hood. I just assumed that it was a form of threaded disk IO. You're saying it's for disk DMA?
void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.
Can you show your encryption code?? 9 cycles per 4 bytes is certainly not at the limits of optimization for 32-bit processors. With a single XOR operation you should be looking at much less than that, even without using SIMD instructions...

“If I understand the standard right it is legal and safe to do this but the resulting value could be anything.”

This topic is closed to new replies.

Advertisement