Jump to content
  • Advertisement
Sign in to follow this  
AndreTheGiant

Simple String Search Index

This topic is 2988 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Ive got an array of strings. Its a pretty big array, and the strings themselves are pretty big (paragraphs, not single words).

I want to write code allowing someone to search the strings and return the index in the array where the string was found.

So far I have done the naive approach:

for each string in the array
if string.indexOf(searchString) >= 0
return array index


I want to speed up the algorithm, which I guess means pre-computing some kind of lookup index (right? - or is there a better algorithm without pre-computing?).

I dont need to write a fancy google search index or anything. Just something that will significantly speed up searches. The simpler the better actually, as long as its faster than the above.

Any ideas or suggestions?

Share this post


Link to post
Share on other sites
Advertisement
Most speed ups for string searching depend on making some assumptions about either the needle string or the haystack string and you haven't indicated what shortcuts are acceptable (for example, can you limit the searches to whole or partial words or do you have to allow for search strings that cross word boundaries i.e. "ptions abo").

One speed up that will work in any case is the Boyer-Moore or Turbo Boyer-Moore algorithm which, for a small amount of set up per search, allows quicker searching through a string. There are lots of worked examples, source and pseudo-code available for these algorithms on the internet, but do be careful if you try to alter them. I made the mistake of attempting to change one set of source between using signed numbers for string indicies to unsigned numbers and found bugs for months.

Share this post


Link to post
Share on other sites
Yeah I remember learning a lot of cool string search algorithms in a bioinformatics course in Uni. (eg Searching for ACGATCCA in a one million character long string of ACGATATA....)

The reason I didnt implement one of those algorithms is, I wasnt sure that it would end up being faster than C#'s String.indexOf() function. I figured C# did some fancy low level stuff, and me implementing one of those string search algorithms would end up being slower. Does anyone have any evidence one way or another? And if I can see some real speed gains by implementing one of the above algorithms, then why doesnt C# use those same algorithms for its indexOf() (and similar) functions?


Anyway I was hoping to tackle the problem from a higher level. My first idea was to precompute, for each paragraph, which letters are not contained in the paragraph. So for example if a paragraph does not contain the letter 'z', then any search containing a 'z' can immediately skip that paragraph. I quickly realized this was pretty naive because the distribution of letters would be all wrong. IE, probably every paragraph would contain the letter 'a'. This technique would really only help if my search word contained a wierd letter like 'z', and even then its not much of an improvement.

I once downloaded an 'indexer' which analyzed some text files on my computer, and built an index. Then I could search against that index, and find strings of text a lot faster than doing a brute force search on the files themselves. What I really want to know, is what kind of things was that indexer doing, and which of the simpler techniques can I implement easily.

Thanks!

Share this post


Link to post
Share on other sites
Quote:
Original post by AndreTheGiant
Ive got an array of strings. Its a pretty big array, and the strings themselves are pretty big (paragraphs, not single words).


Just how big is it?

Quote:
will significantly speed up searches.

How long is it taking right now?

Quote:
then why doesnt C# use those same algorithms for its indexOf() (and similar) functions?

Try the RegEx instead, it should use Boyer Moore.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!