Sign in to follow this  
godsenddeath

finding a string within a string

Recommended Posts

i'm making a word processor, and for the "find" function i was thinking of reading all the text into a buffer, then searching for the string within that buffer, so yeah, how would i find a string within a text buffer?

Share this post


Link to post
Share on other sites
Lots of different ways to do this. Probably whatyou want is to avoid doing comparisons on everything in the text area. What I mean by this is that if the user searches for 'Hello' and you have 10 words, 1 of which is Hello the others not appearing like Hello at all, do you really want to compare all strings completely? Not really. Hence, depending on your language restrictions you might want to think about generating an array that contains all strings in your text area -- so, read everything from text area and use whitespace as a delimiter (anything that is whitespace != string, we assume).

Once you've done that, look through your array and check the first letter of each string to see if it matches the search item first char and so forth. For things that do not match even the first char, you will be able to skip checking the rest of the stuff -- unless you want to get tricky and figure out if a word contains the search token.

Example: Search for 'cap'

Words that have cap in them: cap, handicapped, capsize (etc).

Finding 'cap' in handicapped means you have to search the entire contents of every single word in your text area. The naive approach of only matching the exact search term is much simpler to handle :)

Hopefully that gives you a hint -- I sure hope this is NOT homework :P.

~Shiny.

Share this post


Link to post
Share on other sites
To add to Shiny's;

if you want to treat the text as block, you can do what Shiny said, but skip every "searched word length" characters and check if the character is contained within your searched term. So if you have "I like doggies" and you sear for "tea" you would first take I, not in "tea," skip to 'i' in like, again not in tea, then skip to ' ', still not in tea, then to 'g', not in tea, then to 'e' - that is in 'tea' so you would then look of there's a 't' in front and 'a' behind.

This, of course, may turn out to run like shit, but more ideas never hurt, aye?

Share this post


Link to post
Share on other sites
CaspianB has offered the best advice so far: many of these things are already covered by the standard library of the language you're using.


Anyway, if you insist on doing this yourself (which you shouldn't, heh ;)), finding a substring isn't a tricky job at all: for every character in the string, you check if it's the first character of your substring. If it is, check if the next character in the string is the second character in the substring, and so on. I just wrote a function in Python that does exactly that and it works fine. There are probably ways to optimize it, but I'd only spend time on that if searches take too long.


@Shiny: I find your advice somewhat confusing. Searching for a substring usually does not take different words in the string into account, it's a per-character search operation. I don't see the need to tokenize a string first when that's not really what you want in the first place - unless that's what godsenddeath is after, of course. If that's the case, then your approach makes sense. To optimize this a bit, don't create an array of tokens, but check every first non-whitespace character that comes after a whitespace character against the substrings first character directly.

@Koobazaur: your approach is interesting, but also somewhat more complicated, introducing some special cases - without being more efficient. I wrote an implementation in Python and compared it to my proposed approach above and in most cases, it had to do more checks, especially when searching for long substrings. It's not too bad - between 0% and 6% more checks - but it's nothing better than the naive approach. Alas. :)

Share this post


Link to post
Share on other sites
If you need an already-written solution, then look in your language's standard library for that function. Almost every single language in existence has one of these.

If you are looking for common algorithms, then you have three main substring matching algorithms.

The first and easiest is the naive search: for every index i, does substring(haystack, i, i+strlen(needle)) equal needle ? The theoretical upper bound complexity here is quadratic, but in practice the equality test for a given i fails right away after approximately one comparison, which means for real-world texts the search is linear and reads every character once.

The second is the Knuth-Morris-Pratt algorithm, which constructs a nondeterministic finite automaton with ε-transitions, compiles that automaton to a very lightweight representation, and reads the text with that automaton (which, by design of automaton recognition, will act as if every character was read at most twice). This has a very low time complexity and is reasonably easy to implement once you understand the theory. However, it's not that better than naive search in practice.

The third is the Rabin-Karp algorithm, which computes in constant-time the hash of substring(haystack, i, i+strlen(needle)) and compares it to the hash of needle. This results in examining every character exactly once, plus any possible collisions (which means the sort works better on short needles written in human languages, and works badly on long random needles in small alphabets).

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this