Jump to content
  • Advertisement
Sign in to follow this  
JakeM

Binary pattern searching

This topic is 4816 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

How to search for a binary byte pattern quickly in a large 100MB+ file? Hex editors can do it in a matter of seconds, while my first attempt did in a matter of extremely long minutes? This is what I tried. Yeah I know it looks bad because it is. :)
#define NUM_SEARCH_BYTES	8
//
 BYTE search_bytes[8];
 search_bytes[0] = 0x11;
 search_bytes[1] = 0x22;
 search_bytes[2] = 0x33;
 search_bytes[3] = 0x44;
 search_bytes[4] = 0x55;
 search_bytes[5] = 0x66;
 search_bytes[6] = 0x77;
 search_bytes[7] = 0x88;
//
 BYTE b;
//
 // search from start to end of file
 while(EOF) {

    Get_Byte(b);
    if(b == search_bytes[0]) {				// found 1st byte

       Get_Byte(b);
       if(b == search_bytes[1]) {			// found 2nd byte

          Get_Byte(b);
          if(b == search_bytes[2]) {			// found 3nd byte

             Get_Byte(b);
             if(b == search_bytes[3]) {			// found 4th byte

                Get_Byte(b);
                if(b == search_bytes[4]) {		// found 5th byte

                   Get_Byte(b);
                   if(b == search_bytes[5]) {		// found 6th byte

                      Get_Byte(b);
                      if(b == search_bytes[6]) {	// found 7th byte

                         Get_Byte(b);
                         if(b == search_bytes[7]) {	// found 8th byte

                            printf("FOUND IT!\n");
                         }
                      }
                   }
                }
             }
          }
       }
    }
 }
}

Share this post


Link to post
Share on other sites
Advertisement
Have a look at the different types of string searching algorithms that have been developed. They'll work for binary patterns too (really, they are the same thing). If you Google around you'll probably find all the information and source code snippets you need.

Good luck!

[Edited by - Trapper Zoid on September 6, 2005 4:40:06 AM]

Share this post


Link to post
Share on other sites
I'd venture to say that perhaps the file is parsed into a tree of some sort when loaded. If this was done, searching the tree for the requested pattern becomes trivial and very fast! But then again, doing so would require a lot of memory if stored, so either they save the tree to a file or just use something different altogether.

Share this post


Link to post
Share on other sites
Assuming the pattern I want to search is binary, pattern is of any length and of any character (I don't care about text),
searching will be done sequentially in a large 100MB+ file (beginning to end), and I want to access the file
directly using fopen() (don't want to create any temp. copies in memory, file is too large), what would be
the best algorithm for me?

Share this post


Link to post
Share on other sites
Quote:
Original post by JakeM
...I want to access the file directly using fopen() (don't want to create any temp. copies in memory, file is too large)...

The entire file might be too large, but parts of it aren't. I don't believe fopen type functions buffer for you [not sure...shouldn't take too long to test, though], and file access is extremely slow. If you're going for speed, you want to be reading from memory as much as possible.

CM

Share this post


Link to post
Share on other sites
It would be nice if I could access the file like a buffer as it is. Does CreateFileMapping() have fast file access?
I'm only seeking sequentially forward, not jumping the file pointer around, file access is read only, so that should help?

Share this post


Link to post
Share on other sites
Quote:
Original post by JakeM
while(EOF) {

Get_Byte(b);
if(b == search_bytes[0]) { // found 1st byte

Get_Byte(b);
if(b == search_bytes[1]) { // found 2nd byte

Get_Byte(b);
if(b == search_bytes[2]) { // found 3nd byte

Get_Byte(b);
if(b == search_bytes[3]) { // found 4th byte

Get_Byte(b);
if(b == search_bytes[4]) { // found 5th byte

Get_Byte(b);
if(b == search_bytes[5]) { // found 6th byte

Get_Byte(b);
if(b == search_bytes[6]) { // found 7th byte

Get_Byte(b);
if(b == search_bytes[7]) { // found 8th byte

printf("FOUND IT!\n");
}
}
}
}
}
}
}
}
}
}
There is never any excuse for writing code with that much duplication. Do that in a job interview and your chances of being hired are slim to none. I'm very pleased that you yourself have recognised that it is bad.

As a matter of fact, not only is it bad, it's actually very wrong!
It may not find a string that is preceeded by a substring of the string you are searching for.
i.e. It will NOT find "abcabcde" in the string "hegafabcabcabcdecfegbadh".

Definately check out the link Trapper Zoid posted.

Share this post


Link to post
Share on other sites
Quote:
Original post by Conner McCloud
Quote:
Original post by JakeM
...I want to access the file directly using fopen() (don't want to create any temp. copies in memory, file is too large)...

The entire file might be too large, but parts of it aren't. I don't believe fopen type functions buffer for you [not sure...shouldn't take too long to test, though], and file access is extremely slow. If you're going for speed, you want to be reading from memory as much as possible.

CM


fopen() et al do buffer - that's why getch() is generally implemented as a macro that just grabs one byte from the buffer if it's available. I wouldn't rely on its buffering for that size of file though. The fastest way would probably be to use Windows asynchronous file reading calls or using two threads on Unix, so you can be loading the file and scanning through it at the same time.

Share this post


Link to post
Share on other sites
I'm guessing here, but reading byte by byte from file may be the slowest part.

You really should read at least 64k (or other magic number) sized chunks and perform the check on the memory buffer. And since you are dealing with binary data, it may make sense to use the largest binary data type usable, __int32 or __int64.

something like that:


int k, i, j;
__int64 buf[65536], seq;

seq = ...;
while(!EOF)
{
k = fread(buf, 65536, 1, yourfile) / 8; //k read bytes into k/8 _int64
i = 0;
while(i<k)
{
if(buf == seq)break;
i++;
}
if(i<k)_found(i);
}



The above code doesn't take into consideration what happens when the first part of the binary string is found at the end of one read chunk and the second in the next read chunk, nor if the read chunk cannot be divided be 8.

Share this post


Link to post
Share on other sites
I think it's quite obvious that you should use the boyer moore algorithm. It's the most widely used algorithm in pattern searching in editors and viewers.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!