Searching very large binary files

Started by
8 comments, last by vEEcEE 16 years, 1 month ago
I'm wondering if someone could help me figure out a good way to search a very large binary file for a particular string (sequence of bytes). What I'm actually doing is reading a game's file archive and looking for file headers so I can extract some images. Since these files can get close to a gig in size, I figured I should break it down into something like 1 meg chunks and then use a fast search algorithm like Boyer-Moore to find my string and go from there. That's easy enough, but what can I do about a string that gets chopped off at the end and spans two of these chunks? For example, let's say I'm using a 10 byte buffer and I'm searching for the string "FOOBAR".... aabbccFOOB and the next chunk is ARddeeffgg When search the first chunk, it's going to fail to find the string. What can I do about this? Oh, and if it matters, I'm using C++, but I'm not really looking for code...just ideas. Thanks for your time.
Advertisement
You think 1Gb is very large?
If your search string is of length L, and your buffer is of size N, then search up to N-L bytes, then read next section.

Quote:let's say I'm using a 10 byte buffer


Feel free to use 64Mb-512Mb buffers.
If you want this to go as quickly as possible then you should consider using threads or asynchronous (overlapped) I/O.

The idea is that you create two buffers. You search in one, and read the next bit of file into the other in the background. As long as your search of one buffer completes quicker than the read into the other you won't be able to make it go any quicker by optimizing the search.

To work round the overlap read at an offset of the search string length into the second buffer, and when the read completes copy that many bytes from the end of the first buffer to the start of the second buffer.

The linear read speed of a modern hard drive is around 50MB/s. That means searching a 1GB file should take around 20 seconds.
As long as you just need to find ONE string in the file I'd go with a threaded (overlapped I/O) approach.
If you wan't to search the file several times using different strings then it *might* be faster to preprocess the file once then using the processed data for the actual searching.
There's are several ways to optimize a binary chunk for multiple searches but I doubt it's worth the effort unless you need to find many strings (probably in the hundreds).

My 2c
Quote:Original post by vEEcEE What I'm actually doing is reading a game's file archive and looking for file headers so I can extract some images.

Since these files can get close to a gig in size,...
If I understand this correctly, you don't want to use this in a game or an application that must be anywhere close to real time. You just want to find and extract some data.

So, why don't you forget about buffers and all the complicated stuff?
You could create a file mapping, map the file into your address space, and bluntly call strstr (or a similar function) on the base address.
strstr might not be perfect, it might take a second or two longer than the most sophisticated algorithm, but even this is not guaranteed (it is equally likely that strstr outperforms something self-made). Either way, it doesn't matter, since raw hard disk I/O will almost certainly be the limiting factor, and you don't need it running in under half a second anyway.
Whether the whole thing takes 25 seconds or 3 minutes to run, it won't matter for what you said.

Most every operating system has no issues mapping 2GB files, so "close to a gig" should not be anything to worry about. Performance of memory mapped files is usually good or excellent (it better had be good, too!). And the beauty of it is that you have no buffers and nothing to worry about, the data is "just there".


EDIT:
To clarify, strstr would of course be a bad choice, it was just the first "find" function that came to my mind. Obviously, you would want to use one that takes a maximum length, rather than breaking at \0. But... you get the idea. :)
Quote:Original post by samoth

So, why don't you forget about buffers and all the complicated stuff?


Well, as far as we're trivializing:
grep -bU "FOOBAR" myfile.xyz


It always depends on the context of how the result is to be used. Writing custom application for this is complication enough by itself.
Quote:Original post by AntheusWell, as far as we're trivializing:
grep -bU "FOOBAR" myfile.xyz
Not meaning to trivialize. What I am saying is that memory mapping the file might make the OP's life a lot easier.

The OP wants to:
1. find some string(s) that are image headers (or similar) in a file
2. save the corresponding data to disk

This suggests that he is not intending to do it a million times, and he is not doing it with heavy constraints on runtime (offline tool). Apparently, this is not a datafile the OP made himself, else he could access the data a lot easier.

If you do some complicated loading and fiddling with buffers, async IO, threads, or whatever, you might make it maybe 5-10% faster than memory mapping, but it is a lot more complicated and error-prone. If you do "normal" synchronous IO, it will be a bit less error-prone, but you still need to manage buffers (and, it's slow, too).

Imagine what happens if you use a load buffer size of n bytes, and you find an image header that tells you an image of size n+1 bytes is ahead (or 2*n, or anything else that's bigger than the buffer size you chose) follows, then you have to save one block, load another, and save another portion. The same happens if the data is smaller than your block size, but happens to start in the middle or at the end of a block. Of course it can be done, but it is needlessly complicated.

Having the file memory mapped, all of this is not your problem. You told the OS that you expect this data to be in memory when you access it, the OS gave you back a start address, and the OS better makes sure that the data is there when you need it.
So, after finding an image header that tells you an image starts at address X, having a filesize Y, you can just do a fopen() followed by a fwrite() using addresss X and size Y. Everything else is not your business, it just works. No loading and counting buffers, no math, no synchronisation issues, nothing of that kind.

The other thing that I was saying was that no matter what you do, it won't be memory/algorithm bound. Reading from hard disk will be the limiting factor.
So, if you already have a function boyer_moore_find() implemented which works well, that's fine, use it. But otherwise, use the cheapest/easiest thing you can find, because it really won't matter. Finding the optimal, fastest search function would be wasting your time, because it won't be one second faster.
Memory-mapping sounds like the way to go here. The OP can fine-tune the search algorithm if they wish by using something in place of strstr, but I doubt that's necessary. The only concern is whether there's enough unfragmented address space available to the process. You should be alright at 1 GB.

Memory-mapping is one of the coolest tools available on most OSes and you never hear anything about it.
Thank you all for your replies. I believe I will take the easy route for now and use a memory-mapped file as Samoth suggested. As you figured out, this is not something that's very time critical, so I don't want to complicate things for myself too much. Memory mapping it and basically calling a substring search function to find what I'm looking for sounds perfectly easy. For the future I will definitely look into overlapped I/O and threads, as that looks like a very good approach as well from what I've read on it so far. Once again thanks...much appreciated!

This topic is closed to new replies.

Advertisement