Sign in to follow this  
deadlydog

"Simple" regular expression

Recommended Posts

deadlydog    170
I need a javascript regular expression that is able to search for a word within a string, and will match as long as it is a whole word, and is not contained within double quotes. Now, to find the _Word I am looking for is simple, I just use: new RegExp("\\b" + _Word + "\\b", "i"); and that will create a regular expression which will match the given _Word in a string, making sure it is a whole word (not part of another word). My problem is trying to also make it so the regular expression doesn't match the word if it is contained in quotes, such as "_Word", or "some times _Word happens". I'm sure this is a simple problem for those who speak regexp often. Any help would be appreciated. Thanks in advance. [Edited by - deadlydog on July 12, 2007 10:09:12 AM]

Share this post


Link to post
Share on other sites
rollo    366
you should be able to use negative lookahead to make sure it matches properly.

This seems to work fine in Ruby:

regexp = /\b(?!\")myfancyword(?!\")\b/
regexp.match 'I like myfancyword'
=> Match
regexp.match 'I like "myfancyword"'
=> No Match

dunno if javascript support the lookahead though...

Share this post


Link to post
Share on other sites
deadlydog    170
Quote:
Original post by rollo
you should be able to use negative lookahead to make sure it matches properly.

This seems to work fine in Ruby:

regexp = /\b(?!\")myfancyword(?!\")\b/
regexp.match 'I like myfancyword'
=> Match
regexp.match 'I like "myfancyword"'
=> No Match

dunno if javascript support the lookahead though...


I believe I tried that regular expression, and it did work if the _Word had a double quote directly in front, or directly behind it, but not if it didn't. For example, that would match against "_Word ...", and "..._Word", but not against "..._Word...". I will try your method again tomorrow when I get into work just to be sure, but I'm pretty sure it has that problem. Are there any other suggestions? Thanks.

Share this post


Link to post
Share on other sites
Zahlman    1682
I doubt your problem is really as simple as that. In most situations where you care about "whether something is inside a double-quoted string", it's because you're parsing something source-code-like - which means you also have to handle escaped quotes within the string.

What I would do is first write a regexp that detects double-quoted strings:

"(\\.|[^\\"])*"


That is, a quote, followed by (one or more things which are either a backslash followed by any character - as that would always be part of the string - or a non-quote-non-backslash character), followed by a quote. (Actually, detecting escape sequences properly might be a *little* more complicated.)

Replace all instances of this pattern with nothing (in a new string, if you need to leave the original intact). Then search the *remaining* text for the word.

If you have to replace the word in the original string - good luck :)

Share this post


Link to post
Share on other sites
deadlydog    170
Thanks for the replies. Zahlman, while your idea might work, I can't use it for what I am trying to do. I basically have a function which returns a regular expression which can find the given word in a string...I am not actually given the string to search through or anything like that. So my function to get the regular expressions looks like:

function GetRegularExpressionToFindWholeWord(_Word)
{
return new RegExp("\\b" + _Word + "\\b", "i");
}

Now, I know how to find the whole _Word in a string (by simply using the regular expression above), and I know how to find if the given _Word is between quotes, using

new RegExp("\".*" + _Word + ".*\"", "i");

I am just not sure how I can combine the two into one regular expression, since regular expressions don't seem to have an AND operator (even though they have an OR operator, which seems weird to have one and not the other). Any other suggestions on my problem would be greatly appreciated. Thanks.

Share this post


Link to post
Share on other sites
Vorpy    869
Your expression to find the word between quotes won't work if there are multiple quoted sections within the string being searched. For example:

"This is quoted" Match this _Word "But not this one! _Word"

Both _Words are between quotes, but you only want to match the one that is outside of matching quotes.

I think this can be done by matching 0 or more pairs of quotes, followed by 0 or more non-quote characters, followed by the word you are trying to match.

Share this post


Link to post
Share on other sites
deadlydog    170
Quote:
Original post by Vorpy
Your expression to find the word between quotes won't work if there are multiple quoted sections within the string being searched. For example:

"This is quoted" Match this _Word "But not this one! _Word"

Both _Words are between quotes, but you only want to match the one that is outside of matching quotes.

I think this can be done by matching 0 or more pairs of quotes, followed by 0 or more non-quote characters, followed by the word you are trying to match.

Ahh, thank you, I did not notice that, but it is a feature I will want. I still am not sure how to get it to work with finding a whole word though. Any more suggestions anyone? Thanks

Share this post


Link to post
Share on other sites
deadlydog    170
Nobody here knows regular expressions enough to solve this problem? I figured this would be a simple problem, but I guess it's harder than I thought. I'm still open to any suggestions anyone might have. Thanks.

Share this post


Link to post
Share on other sites
Nathan Baum    1027
This seems to work:

^([^"]|("(([^"\]|\\.)+)"))*\b(_Word)\b

It matches _Word surrounded by non-letter characters preceded by an even, possibly zero, number of unescaped quotation marks. That will ensure that _Word isn't in a string.

Share this post


Link to post
Share on other sites
Vorpy    869
I think this regex works:

^([^"]|("[^"]*"))*\b(word)\b

It matches any number of non-quote characters or quoted strings and then the word you are looking for.

Share this post


Link to post
Share on other sites
deadlydog    170
Awesome. Thanks for the replies guys! I tried both solutions and they both seemed to work the same. This is what I ended up using:

return new RegExp("^([^\"]|(\"(([^\"])*)\"))*\\b(" + _Word + ")\\b");

My problem now is that one of the functions which uses this regular expression, uses it to find an replace the given word. The problem is that this regular expression matches not only against the _Word, but also everything before it. So for example if I try to replace 'dogs' with 'apples' in the sentence "I like dogs and cats", instead of getting "I like apples and cats", I get "apples and cats". Does anybody know a way I can get around this so the regular expression matches only against the _Word, and not everything before it?

I thought about using the $1,$2,... parameters to try and remember all of the text before the _Word so I could replace the word with "$1$2$3$4 apples", but the $1 variables always appear to be blank (I was using 4 array values since there are 4 '(' brackets before the brackets around the _Word).

Any suggestions would be appreciated. Thanks.

Share this post


Link to post
Share on other sites
Vorpy    869
Add another set of parentheses that contains the first pair as well as the * immediately after it. This will make $1 correspond to the part of the string that matched before the word. Each pair of parentheses corresponds to one of the $x values, with $0 being the whole string. So all you need is a pair of parentheses that captures everything before the word.

Share this post


Link to post
Share on other sites
deadlydog    170
Quote:
Original post by Vorpy
Add another set of parentheses that contains the first pair as well as the * immediately after it. This will make $1 correspond to the part of the string that matched before the word. Each pair of parentheses corresponds to one of the $x values, with $0 being the whole string. So all you need is a pair of parentheses that captures everything before the word.


Haha, I actually thought of this and tried it right before you posted about it, and it works. So for anyone who cares, this is what my final regular expression looks like:

return new RegExp("(^(?:[^\"]|(?:\"(?:(?:[^\"])*)\"))*)\\b(" + _Word + ")\\b", "i");

So everything before the word is stored in RegExp.$1, and the word itself is stored in RegExp.$2.

Thanks for all the help guys!!

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this