#define NUM_SEARCH_BYTES 8
//
BYTE search_bytes[8];
search_bytes[0] = 0x11;
search_bytes[1] = 0x22;
search_bytes[2] = 0x33;
search_bytes[3] = 0x44;
search_bytes[4] = 0x55;
search_bytes[5] = 0x66;
search_bytes[6] = 0x77;
search_bytes[7] = 0x88;
//
BYTE b;
//
// search from start to end of file
while(EOF) {
Get_Byte(b);
if(b == search_bytes[0]) { // found 1st byte
Get_Byte(b);
if(b == search_bytes[1]) { // found 2nd byte
Get_Byte(b);
if(b == search_bytes[2]) { // found 3nd byte
Get_Byte(b);
if(b == search_bytes[3]) { // found 4th byte
Get_Byte(b);
if(b == search_bytes[4]) { // found 5th byte
Get_Byte(b);
if(b == search_bytes[5]) { // found 6th byte
Get_Byte(b);
if(b == search_bytes[6]) { // found 7th byte
Get_Byte(b);
if(b == search_bytes[7]) { // found 8th byte
printf("FOUND IT!\n");
}
}
}
}
}
}
}
}
}
}
Binary pattern searching
How to search for a binary byte pattern quickly in a large 100MB+ file? Hex editors
can do it in a matter of seconds, while my first attempt did in a matter of extremely
long minutes?
This is what I tried. Yeah I know it looks bad because it is. :)
Have a look at the different types of string searching algorithms that have been developed. They'll work for binary patterns too (really, they are the same thing). If you Google around you'll probably find all the information and source code snippets you need.
Good luck!
[Edited by - Trapper Zoid on September 6, 2005 4:40:06 AM]
Good luck!
[Edited by - Trapper Zoid on September 6, 2005 4:40:06 AM]
I'd venture to say that perhaps the file is parsed into a tree of some sort when loaded. If this was done, searching the tree for the requested pattern becomes trivial and very fast! But then again, doing so would require a lot of memory if stored, so either they save the tree to a file or just use something different altogether.
Assuming the pattern I want to search is binary, pattern is of any length and of any character (I don't care about text),
searching will be done sequentially in a large 100MB+ file (beginning to end), and I want to access the file
directly using fopen() (don't want to create any temp. copies in memory, file is too large), what would be
the best algorithm for me?
searching will be done sequentially in a large 100MB+ file (beginning to end), and I want to access the file
directly using fopen() (don't want to create any temp. copies in memory, file is too large), what would be
the best algorithm for me?
Quote:Original post by JakeM
...I want to access the file directly using fopen() (don't want to create any temp. copies in memory, file is too large)...
The entire file might be too large, but parts of it aren't. I don't believe fopen type functions buffer for you [not sure...shouldn't take too long to test, though], and file access is extremely slow. If you're going for speed, you want to be reading from memory as much as possible.
CM
It would be nice if I could access the file like a buffer as it is. Does CreateFileMapping() have fast file access?
I'm only seeking sequentially forward, not jumping the file pointer around, file access is read only, so that should help?
I'm only seeking sequentially forward, not jumping the file pointer around, file access is read only, so that should help?
Quote:Original post by JakeMThere is never any excuse for writing code with that much duplication. Do that in a job interview and your chances of being hired are slim to none. I'm very pleased that you yourself have recognised that it is bad.while(EOF) { Get_Byte(b); if(b == search_bytes[0]) { // found 1st byte Get_Byte(b); if(b == search_bytes[1]) { // found 2nd byte Get_Byte(b); if(b == search_bytes[2]) { // found 3nd byte Get_Byte(b); if(b == search_bytes[3]) { // found 4th byte Get_Byte(b); if(b == search_bytes[4]) { // found 5th byte Get_Byte(b); if(b == search_bytes[5]) { // found 6th byte Get_Byte(b); if(b == search_bytes[6]) { // found 7th byte Get_Byte(b); if(b == search_bytes[7]) { // found 8th byte printf("FOUND IT!\n"); } } } } } } } } }}
As a matter of fact, not only is it bad, it's actually very wrong!
It may not find a string that is preceeded by a substring of the string you are searching for.
i.e. It will NOT find "abcabcde" in the string "hegafabcabcabcdecfegbadh".
Definately check out the link Trapper Zoid posted.
Quote:Original post by Conner McCloudQuote:Original post by JakeM
...I want to access the file directly using fopen() (don't want to create any temp. copies in memory, file is too large)...
The entire file might be too large, but parts of it aren't. I don't believe fopen type functions buffer for you [not sure...shouldn't take too long to test, though], and file access is extremely slow. If you're going for speed, you want to be reading from memory as much as possible.
CM
fopen() et al do buffer - that's why getch() is generally implemented as a macro that just grabs one byte from the buffer if it's available. I wouldn't rely on its buffering for that size of file though. The fastest way would probably be to use Windows asynchronous file reading calls or using two threads on Unix, so you can be loading the file and scanning through it at the same time.
I'm guessing here, but reading byte by byte from file may be the slowest part.
You really should read at least 64k (or other magic number) sized chunks and perform the check on the memory buffer. And since you are dealing with binary data, it may make sense to use the largest binary data type usable, __int32 or __int64.
something like that:
The above code doesn't take into consideration what happens when the first part of the binary string is found at the end of one read chunk and the second in the next read chunk, nor if the read chunk cannot be divided be 8.
You really should read at least 64k (or other magic number) sized chunks and perform the check on the memory buffer. And since you are dealing with binary data, it may make sense to use the largest binary data type usable, __int32 or __int64.
something like that:
int k, i, j;__int64 buf[65536], seq;seq = ...;while(!EOF){ k = fread(buf, 65536, 1, yourfile) / 8; //k read bytes into k/8 _int64 i = 0; while(i<k) { if(buf == seq)break; i++; } if(i<k)_found(i);}
The above code doesn't take into consideration what happens when the first part of the binary string is found at the end of one read chunk and the second in the next read chunk, nor if the read chunk cannot be divided be 8.
I think it's quite obvious that you should use the boyer moore algorithm. It's the most widely used algorithm in pattern searching in editors and viewers.
This topic is closed to new replies.
Advertisement
Popular Topics
Advertisement