Back to General and Gameplay Programming

Binary pattern searching

General and Gameplay Programming Programming

Started by JakeM September 05, 2005 10:37 PM

14 comments, last by JakeM 18 years, 7 months ago

JakeM

168

Author

September 05, 2005 10:37 PM

How to search for a binary byte pattern quickly in a large 100MB+ file? Hex editors can do it in a matter of seconds, while my first attempt did in a matter of extremely long minutes? This is what I tried. Yeah I know it looks bad because it is. :)


#define NUM_SEARCH_BYTES	8
//
 BYTE search_bytes[8];
 search_bytes[0] = 0x11;
 search_bytes[1] = 0x22;
 search_bytes[2] = 0x33;
 search_bytes[3] = 0x44;
 search_bytes[4] = 0x55;
 search_bytes[5] = 0x66;
 search_bytes[6] = 0x77;
 search_bytes[7] = 0x88;
//
 BYTE b;
//
 // search from start to end of file
 while(EOF) {

    Get_Byte(b);
    if(b == search_bytes[0]) {				// found 1st byte

       Get_Byte(b);
       if(b == search_bytes[1]) {			// found 2nd byte

          Get_Byte(b);
          if(b == search_bytes[2]) {			// found 3nd byte

             Get_Byte(b);
             if(b == search_bytes[3]) {			// found 4th byte

                Get_Byte(b);
                if(b == search_bytes[4]) {		// found 5th byte

                   Get_Byte(b);
                   if(b == search_bytes[5]) {		// found 6th byte

                      Get_Byte(b);
                      if(b == search_bytes[6]) {	// found 7th byte

                         Get_Byte(b);
                         if(b == search_bytes[7]) {	// found 8th byte

                            printf("FOUND IT!\n");
                         }
                      }
                   }
                }
             }
          }
       }
    }
 }
}

Trapper Zoid

1,370

September 05, 2005 10:40 PM

Have a look at the different types of string searching algorithms that have been developed. They'll work for binary patterns too (really, they are the same thing). If you Google around you'll probably find all the information and source code snippets you need.

Good luck!

[Edited by - Trapper Zoid on September 6, 2005 4:40:06 AM]

Drew_Benton

1,865

September 05, 2005 10:41 PM

I'd venture to say that perhaps the file is parsed into a tree of some sort when loaded. If this was done, searching the tree for the requested pattern becomes trivial and very fast! But then again, doing so would require a lot of memory if stored, so either they save the tree to a file or just use something different altogether.

JakeM

168

Author

September 05, 2005 10:44 PM

Assuming the pattern I want to search is binary, pattern is of any length and of any character (I don't care about text),
searching will be done sequentially in a large 100MB+ file (beginning to end), and I want to access the file
directly using fopen() (don't want to create any temp. copies in memory, file is too large), what would be
the best algorithm for me?

Conner McCloud

1,135

September 05, 2005 10:58 PM

Quote:Original post by JakeM
...I want to access the file directly using fopen() (don't want to create any temp. copies in memory, file is too large)...

The entire file might be too large, but parts of it aren't. I don't believe fopen type functions buffer for you [not sure...shouldn't take too long to test, though], and file access is extremely slow. If you're going for speed, you want to be reading from memory as much as possible.

CM

JakeM

168

Author

September 05, 2005 11:08 PM

It would be nice if I could access the file like a buffer as it is. Does CreateFileMapping() have fast file access?
I'm only seeking sequentially forward, not jumping the file pointer around, file access is read only, so that should help?

iMalc

2,466

September 06, 2005 03:44 AM

Quote:Original post by JakeM

while(EOF) {    Get_Byte(b);    if(b == search_bytes[0]) {				// found 1st byte       Get_Byte(b);       if(b == search_bytes[1]) {			// found 2nd byte          Get_Byte(b);          if(b == search_bytes[2]) {			// found 3nd byte             Get_Byte(b);             if(b == search_bytes[3]) {			// found 4th byte                Get_Byte(b);                if(b == search_bytes[4]) {		// found 5th byte                   Get_Byte(b);                   if(b == search_bytes[5]) {		// found 6th byte                      Get_Byte(b);                      if(b == search_bytes[6]) {	// found 7th byte                         Get_Byte(b);                         if(b == search_bytes[7]) {	// found 8th byte                            printf("FOUND IT!\n");                         }                      }                   }                }             }          }       }    } }}

There is never any excuse for writing code with that much duplication. Do that in a job interview and your chances of being hired are slim to none. I'm very pleased that you yourself have recognised that it is bad.

As a matter of fact, not only is it bad, it's actually very wrong!
It may not find a string that is preceeded by a substring of the string you are searching for.
i.e. It will NOT find "abcabcde" in the string "hegafabcabcabcdecfegbadh".

Definately check out the link Trapper Zoid posted.

"In order to understand recursion, you must first understand recursion."
My website dedicated to sorting algorithms

ZQJ

496

September 06, 2005 04:51 AM

Quote:Original post by Conner McCloud
Quote:Original post by JakeM
...I want to access the file directly using fopen() (don't want to create any temp. copies in memory, file is too large)...

The entire file might be too large, but parts of it aren't. I don't believe fopen type functions buffer for you [not sure...shouldn't take too long to test, though], and file access is extremely slow. If you're going for speed, you want to be reading from memory as much as possible.

CM

fopen() et al do buffer - that's why getch() is generally implemented as a macro that just grabs one byte from the buffer if it's available. I wouldn't rely on its buffering for that size of file though. The fastest way would probably be to use Windows asynchronous file reading calls or using two threads on Unix, so you can be loading the file and scanning through it at the same time.

ttdeath

100

September 06, 2005 05:26 AM

I'm guessing here, but reading byte by byte from file may be the slowest part.

You really should read at least 64k (or other magic number) sized chunks and perform the check on the memory buffer. And since you are dealing with binary data, it may make sense to use the largest binary data type usable, __int32 or __int64.

something like that:

int k, i, j;__int64 buf[65536], seq;seq = ...;while(!EOF){   k = fread(buf, 65536, 1, yourfile) / 8; //k read bytes into k/8 _int64   i = 0;   while(i<k)   {      if(buf == seq)break;      i++;   }   if(i<k)_found(i);}

The above code doesn't take into consideration what happens when the first part of the binary string is found at the end of one read chunk and the second in the next read chunk, nor if the read chunk cannot be divided be 8.

-------------------My Company My RentACoder's Profile

toxigun

122

September 06, 2005 06:57 AM

I think it's quite obvious that you should use the boyer moore algorithm. It's the most widely used algorithm in pattern searching in editors and viewers.

Binary pattern searching

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Binary pattern searching

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines