|
||||||||||||||||||
Add Forum to Favorites | Send Topic To a Friend | View Forum FAQ | Track this topic |
Last Thread Next Thread ![]() |
| Overlapped file I/O |
|
![]() BradDaBug Member since: 5/2/2001 From: Cuba, MS, United States |
||||
|
|
||||
| I've been playing with asynchronouse file I/O on *nix and it seems to work pretty well, but I'm having trouble getting it to work on Windows. It seems like no matter how much data I try to read with ReadFile() it still behaves synchronously. When I try to read in around 200 MB ReadFile() fails with an ERROR_NO_SYSTEM_RESOURCES error. Here's some code: #include <windows.h> #include <iostream> using namespace std; int main(int argc, char* argv[]) { HANDLE file = CreateFile("setup.exe.part", FILE_READ_DATA, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_FLAG_OVERLAPPED, NULL); if (file == INVALID_HANDLE_VALUE) cout << "Unable to open file!" << endl; else cout << "Opened file!" << endl; OVERLAPPED o; o.Offset = 0; o.OffsetHigh = 0; o.hEvent = CreateEvent(NULL, true, false, NULL); int size =100000000; char* buffer = new char[size]; DWORD bytesRead; bool r = ReadFile(file, buffer, size, &bytesRead, &o); if (!r) { int error = GetLastError(); if (error != ERROR_IO_PENDING) { cout << "ERROR! " << GetLastError() << endl; return 1; } } else cout << "Success! bytesread= " << bytesRead << endl; while(!HasOverlappedIoCompleted(&o)) { cout << "working..." << endl; } cout << "finished! bytesRead = " << bytesRead << endl; delete[] buffer; return 0; } I have very little experience with Win32 programming, so I have no idea if all the flags and whatnot are set correctly. Anyway, in this particular bit of code, I always see "Opened file!" then maybe a pause, then "Success! bytesread=???" and then immediately "finished! bytesRead = ???" (the ??? is an actual number). I never see "working..." printed at all. Anyone see what I'm doing wrong? |
||||
|
||||
![]() bpoint Member since: 7/10/2005 From: Okinawa, Japan |
||||
|
|
||||
| Your code looks like it could be correct, but I've only used ReadFileEx and WriteFileEx to do asynchronous transfers. You might want to try using those instead. Also, there is a limit to the number of bytes that can be performed in a single transfer, which is why you're getting an error with ~200Mb. You will need to split large reads up into multiple transfers. Personally, I use a block size of 32Mb. |
||||
|
||||
![]() BradDaBug Member since: 5/2/2001 From: Cuba, MS, United States |
||||
|
|
||||
| I changed the line with ReadFile() to this: int r = ReadFileEx(file, buffer, size, &o, NULL); but it still doesn't seem to be working. Or maybe it is working and the results aren't quite as obvious as they are on *nix. But if I can read in several MB without seeing any evidence that it's actually asynchronous, what's the point? I hope I'm still doing something wrong. |
||||
|
||||
![]() Jan Wassenberg Member since: 9/16/2002 From: Karlsruhe, Germany |
||||
|
|
||||
Quote: I bet the cause is this: your CreateFile flags do not include FILE_FLAG_NO_BUFFERING, so the IO still goes through the OS's (synchronous) file cache. Avoiding that is key to achieving peak throughput - it allows direct DMA. Unfortunately this requires you to align everything to sector boundaries, which is quite a bit of work to do efficiently. Quote: Wow, that's a lot. Did you mean 32KB? That's what I have found to be optimal with my aio code. If not, what's the application? AFAIK the disk driver receives a scatter-gather list of pages to write into, which is by default limited to 17 pages = 64KB (since these SG lists reside in nonpaged pool, which is not too plentiful). I figure any larger transfer will be split up anyway, so you might as well set up smaller blocks and just queue more of them (enough to prevent starvation). Disclaimer for craziness: am writing thesis on this topic ;) |
||||
|
||||
![]() bpoint Member since: 7/10/2005 From: Okinawa, Japan |
|||||
|
|
|||||
Quote: While that would help increase the speed of transfers, it is not a requirement to enable asynchronous read/writes. Quote: Er.. I took a look at my source code again to make sure what I was using after you asking me that. :) I actually use a variable block transfer size depending on the total amount of data that needs to be transferred -- but this block size is capped between 64KB and 1MB. So no, I don't use 32MB, sorry. However, I noticed in my source that I had made a comment about asychronous transfers not behaving properly with block sizes larger than 32MB -- that must have been where I picked it up from the first time. :) Quote: First of all, try cutting your transfer sizes down to nothing larger than 32MB (better yet, even smaller). Secondly, try to use varying data sets, instead of the same file over and over again. Under Windows, asynchronous transfers are _not_ guaranteed to be asynchronous. If the OS feels like it (usually if the data is already in a cache in memory, or if the dataset is small), then the transfer will complete immediately, and ReadFileEx/WriteFileEx will return success. Only if GetLastError returns ERROR_IO_PENDING does it mean that an asynchronous transfer has been queued. |
|||||
|
|||||
![]() bpoint Member since: 7/10/2005 From: Okinawa, Japan |
||||
|
|
||||
| By the way, I liked your easy-to-understand tutorial on how to do asynchronous transfers under Linux / MacOS. It'll come in handy whenever I decide to get off my lazy duff and write a Linux port. :) |
||||
|
||||
![]() Jan Wassenberg Member since: 9/16/2002 From: Karlsruhe, Germany |
||||
|
|
||||
Quote: hm. The way I understand it, without this flag the data read from file will have to be added to the OS file cache. Since the OS doesn't necessarily want to double-buffer your data and can't guarantee its validity at a later date, I figure the data will have to be added immediately after IO completion and before the app can do anything to it. That means ReadFile would wait for completion, add to cache, and only then return. Can the OP confirm whether adding this flag triggers the desired asynchronous behavior? Quote: hm, ok. Can you please go into more detail as to the application? Any publication/article on the topic? I'm nearing completion on my thesis / fast IO library, so any related work is interesting :) |
||||
|
||||
![]() Catafriggm Member since: 5/15/2005 From: La Mirada, CA, United States |
||||
|
|
||||
| I think you critically misunderstand how the cache works. Data from files is read INTO the cache, then copied out of it as needed. If I recall correctly, the only time this would cause async I/O to complete synchronously is if the entire amount of data for the read is in the cache, and so just needs to be copied from memory. No buffering is nice when applicable because it doesn't pollute the cache. For example, one program I wrote searches a particular PS2 game's disc looking for audio files, then converts them to PCM and writes them to the HD. This involves reading in several hundred megs of stuff, and writing out a couple gigs (neither of which will ever be used again). I have 1 gig RAM. See where this is going? :P Also, no buffering mode is slightly faster because it doesn't have to be copied out of the cache once read. [Edited by - Catafriggm on March 20, 2006 10:21:32 PM] |
||||
|
||||
![]() BradDaBug Member since: 5/2/2001 From: Cuba, MS, United States |
||||
|
|
||||
| This is odd... If I add that flag then there's a second or so delay from the time the "finished!" message pops up and the "Press any key to continue" message shows up. If I remove that flag then the next time the program runs the delay is still there, but each time after that the delay is gone. So it sounds like there really is a buffering issue going on. Also it seems like the HasOverlappedIoCompleted() macro call isn't doing its job, and that execution is moving past that point, printing the "finished!" message, then finally stops and waits for the request to finish before showing the "Press any key to continue" message. BTW, this morning I wrote a simple program similar to this with C# and .NET and it worked perfectly. Grr... |
||||
|
||||
![]() bpoint Member since: 7/10/2005 From: Okinawa, Japan |
||||
|
|
||||
| Ok, here we go. I hacked some source together in about 15 minutes. This shows how asynchronous reads could be implemented with a callback system (using ReadFileEx) and how transferring in blocks allows the amount of data transferred to be retrieved. This sample does not do any error checking at all -- this is left as an exercise for the reader. :) #include <stdio.h> #include <windows.h> #define BLOCK_SIZE 65536 #define TEST_FILENAME "C:\\Games\\Quake 4 Demo\\q4base\\pak001.pk4" static HANDLE hdl; static unsigned long offset, bytesRead, blockCount, remainder; static void *basePtr; static bool asyncDone; static OVERLAPPED ov; static void _readNextBlock(void); static VOID CALLBACK _blockComplete(DWORD dwErrorCode, DWORD dwNumberOfBytesTransferred, LPOVERLAPPED lpOverlapped) { basePtr = (void *) ((char *)basePtr + dwNumberOfBytesTransferred); bytesRead += dwNumberOfBytesTransferred; // check for remaining blocks if ((blockCount > 0) || (remainder > 0)) _readNextBlock(); else asyncDone = true; } static void _readNextBlock(void) { LARGE_INTEGER li; unsigned long size; // determine block size if (blockCount > 0) { size = BLOCK_SIZE; blockCount--; } else { size = remainder; remainder = 0; } // setup overlapped structure memset(&ov, 0, sizeof(ov)); li.QuadPart = offset + bytesRead; ov.Offset = li.LowPart; ov.OffsetHigh = li.HighPart; ov.hEvent = NULL; ReadFileEx(hdl, (LPVOID)basePtr, size, &ov, _blockComplete); } void asyncRead(void *ptr, unsigned long offs, unsigned long size) { blockCount = size / BLOCK_SIZE; remainder = size % BLOCK_SIZE; basePtr = ptr; asyncDone = false; bytesRead = 0; offset = offs; _readNextBlock(); } void main(void) { void *ptr; unsigned long size = 1024 * 1024 * 128; unsigned long offset = 1024 * 1024 * 0; printf("opening file...\n"); hdl = CreateFile(TEST_FILENAME, GENERIC_READ, 0, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED, NULL); printf("allocating memory...\n"); ptr = malloc(size); printf("reading...\n"); asyncRead(ptr, offset, size); while (!asyncDone) { printf("read %d bytes\n", bytesRead); // allow our thread to enter an alertable state so the async callback function will be called SleepEx(0, TRUE); } printf("done! total bytes read: %d (size=%d)\n", bytesRead, size); gets((char *)ptr); free(ptr); CloseHandle(hdl); } Ideally, you would use the hEvent member in the OVERLAPPED structure to keep a pointer to a structure which holds the important variables, like blockCount, remainder, etc. rather than the static variables as I've done. Don't forget to change the TEST_FILENAME -- unless you happen to have the Quake 4 demo installed there too. :) Also note that running this multiple times will cache the data, so you will need to tweak the "offset" value to start reading from a different location. Feel free to ask any questions. I'll do my best to help out. Quote: I'll send you a PM about it. |
||||
|
||||
![]() Catafriggm Member since: 5/15/2005 From: La Mirada, CA, United States |
||||
|
|
||||
| There's an ancient post on my blog about the different types of asynchronous I/O notification mechanisms and their advantages/disadvantages. |
||||
|
||||
![]() Jan Wassenberg Member since: 9/16/2002 From: Karlsruhe, Germany |
||||
|
|
||||
Quote: "Critically misunderstand"? heh. Maybe, but I'm still not seeing how to "read INTO the cache" without making IO synchronous (given the Windows APIs). The HD DMAs into the *application's* buffer, which may change the data when it *believes* the IO has completed (either by seeing its data, or polling via macro, or actual GetOverlappedResult). If the OS copies data into its file cache after that, we have screwed the pooch. Now in my system, things are different: I control all file buffers; they are shared between multiple users and locked-down via MMU to prevent any modification at all. Zero-copy IO means IO goes directly into the file cache, which is what is returned to the user. That solves the problems related to asynchronous caching. Quote: Aha. I believe you need to zero out the overlapped structure. This macro is checking the OVERLAPPED.Internal(or InternalHigh?) field, which is set when transfer is complete. Quote: Thanks! |
||||
|
||||
![]() bpoint Member since: 7/10/2005 From: Okinawa, Japan |
||||
|
|
||||
Quote: Good eye! In my code, I have the memset there, but it didn't even occur to me that that could be the problem with the OP's code. I'd be willing to bet that's what the problem is... |
||||
|
||||
![]() Catafriggm Member since: 5/15/2005 From: La Mirada, CA, United States |
||||
|
|
||||
Quote: I have no clue where you heard that, but I suggest you forget you ever did so at the earliest possible time. When the data is to be cached, the HD reads either into an internal driver buffer, or into the cache page directly. It does NOT read directly into the application buffer, for the obvious reason you listed: if it did, it would be possible to corrupt the system cache, and that absolutely can't happen. Think about it for a second. Even if the call executed synchronously (were it read into the application buffer directly), it would still be possible to modify the data from other threads (possibly even other processes, if the buffer is a memory mapped segment), corrupting the cache (there are lots of other reasons it's impossible to read directly into the application buffer, but I think this one should be sufficient to convince you). Quote: I just looked at the dissassembly, and ReadFile does set Internal to STATUS_PENDING (which is what I had suspected). [Edited by - Catafriggm on March 21, 2006 2:15:57 PM] |
||||
|
||||
![]() BradDaBug Member since: 5/2/2001 From: Cuba, MS, United States |
||||
|
|
||||
| Apparently at some point after I posted I added a line to memset() the overlapped structure to 0 after I declare it, so that's not the problem. I haven't had time to give bpoint's code a try. I'll do that a little later. |
||||
|
||||
![]() Jan Wassenberg Member since: 9/16/2002 From: Karlsruhe, Germany |
||||
|
|
||||
| Catafriggm: I have done some digging and the NT file cache works as follows: 1) grab memory in form of VM sections 2) upon cache miss, map (not copy!) file into one of these sections. Since Mm doesn't support async page faults, we have explained why synchronous operation is creeping in. => use FILE_FLAG_NO_BUFFER. OK, makes sense that zero copy IO is not allowed for security reasons. These 2 findings make me even happier I decided to implement a separate file cache and bypass the Windows cache manager :) Quote: Namely? As stated it works great in my user-mode library (which uses FILE_FLAG_NO_BUFFER; the only copying going on is for scatter-gather list, if at all). |
||||
|
||||
![]() Catafriggm Member since: 5/15/2005 From: La Mirada, CA, United States |
||||
|
|
||||
Quote: Lots of requirements have to be met to do DMA (though not all are necessary in all cases). The buffer must be properly aligned in memory, must be the proper size, must be reading from an aligned disk offset, must by contiguous in physical memory, etc. Some of these are impossible to implement in user mode, and almost all of them are very rare in the general case (in code not specifically made to use non-buffered I/O). Quote: You could have just read Inside Windows 2000, like I have. You're missing a key point here: asynchronous reads aren't performed by the thread that issued the request. As such, even if it did block for a page-in, the calling thread wouldn't block. Looking back at that book, it appears I was wrong: even if the entire contents of the read are in the cache it won't complete synchronously (because of the possibility of paging delay). UPDATE: After doing some digging in MSDN and the IFS kit, and then having that information confirmed by two friends (more knowledgeable than I in driver matters), I've determined that async I/O reads that involve the cache do NOT block when a page-in is necessary. It's possible to detect when a page-in would occur, and offload the operation to a worker thread. The only time that page-in blocking is possible (apart from crappy drivers that don't check for that) is using the fast I/O path, which is only used for synchronous reads in which all of the data is in the cache (in an earlier post I mistakenly thought this path was used for async I/O, as well). [Edited by - Catafriggm on March 21, 2006 4:57:41 PM] |
||||
|
||||
![]() Jan Wassenberg Member since: 9/16/2002 From: Karlsruhe, Germany |
|||||
|
|
|||||
Quote: Yep. AFAIK all except being in contiguous physical memory below 24MB are requirements of non-buffered IO anyway, which leads me to say copying from scatter-gather list <-> DMA buffer should be the only copy. Quote: Unfortunately my book budget is low and every time I've tried, that book has been checked out of the Uni library. It's almost Kafkaesque :P Quote: Confirmed, saw this here in section "Data Is not in Cache". Interesting; again learned something. However, they go on to say that these limited worker threads may be overloaded by too many IOs, thus again causing blocking. Quote: hrm? Are we talking about FastIO? IIRC the only point of those is to (by means of caching) prevent having to assemble an IRP. If they can handle the IO from their cache, fine; otherwise, they are not allowed to block and must return FALSE; the IO will then proceed via the normal mechanism, i.e. IRP. Interesting topic :) |
|||||
|
|||||
![]() BradDaBug Member since: 5/2/2001 From: Cuba, MS, United States |
||||
|
|
||||
Whew! I finally got it working! Here is the Windows equivalent of the same stuff I had working on Linux and Mac:#include <windows.h> #include <iostream> using namespace std; int main(int argc, char* argv[]) { HANDLE file = CreateFile("some really big file", FILE_READ_DATA, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_FLAG_OVERLAPPED, NULL); if (file == INVALID_HANDLE_VALUE) cout << "Unable to open file!" << endl; else cout << "Opened file!" << endl; OVERLAPPED o; memset(&o, 0, sizeof(OVERLAPPED)); o.Offset = 0; o.OffsetHigh = 0; o.hEvent = 0; int size =51200000; char* buffer = new char[size]; DWORD bytesRead = 0; cout << "reading..." << endl; bool r = ReadFile(file, buffer, size, &bytesRead, &o); while(!HasOverlappedIoCompleted(&o)) { cout << "working..." << endl; } cout << "done!" << endl; delete[] buffer; CloseHandle(file); return 0; } I get the same behavior with and without the FILE_FLAG_NO_BUFFERING that I mentioned earlier, which is fine with me. Also, you'll see that I'm using ReadFile() instead of ReadFileEx(). For some reason I could never get HasOverlappedIoCompleted() or GetOverlappedResult() to work when I was using ReadFileEx(). As soon as I started using ReadFile() instead it started working. I guess maybe those two functions only work with ReadFile() and when you use ReadFileEx() you have to use the callback or event system. I don't like the callback way to do it because it requires you to put your thread into an interuptable state (like with a SleepEx() call), and I don't want to do that. I haven't looked at using the hEvent field that closely. Do you have to use SleepEx() or something like that to use the event? Is there any significance performance overhead with calling HasOverlappedIoCompleted() once per frame per outstanding IO request? |
||||
|
||||
![]() bpoint Member since: 7/10/2005 From: Okinawa, Japan |
||||
|
|
||||
Quote: #define HasOverlappedIoCompleted(lpOverlapped) (((DWORD)(lpOverlapped)->Internal) != STATUS_PENDING) I would say "no". :) Unfortunately, I've only ever used ReadFileEx/WriteFileEx, so I won't be able to help with your other questions. |
||||
|
||||
![]() Catafriggm Member since: 5/15/2005 From: La Mirada, CA, United States |
||||
|
|
||||
| Looking at the dissassembly, it appears that ReadFileEx does not set the Internal field. So that would explain why those functions wouldn't work. ReadFile always sets Internal, so those functions should always work. |
||||
|
||||
All times are ET (US)![]() |
Last Thread Next Thread ![]() |
|