Regex advice

Started by
4 comments, last by Zahlman 18 years, 8 months ago
Basically, I'm trying to find all the HTML image tags without the alt text attribute. Rather than manually searching through 130 FrontPage generated HTML files, I was hoping I could find a regex expression that would point me to the offenders. Tha problem is that I just started learning about regular expressions about an hour ago. So is it even possible to get a match on "<img", but only when there's no "alt" on the same line? The closest thing I can come up with so far is
<img(?!alt)
but this will only match when there is an "alt" and it's not right after the "<img". Can anyone offer any help? Am I on the right path? Thanks.
Advertisement
You likely want to look into the use of . and *

Though for things like that, I prefer to use perl [or a second grep call] and then just run a secondary search on the lines that match <img. Much more straightforward.
Hmm. Assuming your regex implementation supports negative lookahead assertions, you could do something like this:

<img\s+(?![^>]*?\s+alt\s*=\s*['"]?).*?>

Breaking it down:

opening "<img" tag followed by at least one whitespace character
<img\s+

Anything that isn't ">" or an alt attribute. (this is the negative lookahead assertion)
(?![^>]*?\s+alt\s*=\s*['"]?)

breaking that down further, the alt=" match looks for whitespace preceeding an "alt" followed by any ammount of whitespace followed by "=" followed again by any ammount of whitespace, and then an optional quote character. This allows the text "alt" to appear within any other attribute, so long as it's not an attribute itself.. This will match the following:
alt=cows
alt = cows
alt = "cows"
alt='cows'
etc..
but not:
src="http://.../alternates/"
src="alt.gif"




And then matching anything up to the first closing ">" The '?' after .* makes it a non-greedy match.
.*?>

Regular expressions are very powerful, I highly recommend getting a good reference book on them, or spending some time on the internet looking at the online tutorials and playing with it. Like most things, practice will make you better.

Disclaimer: The above expression may be bugged.. there are many situations where it might not work, in my experience, any time I've used complex expressions I've had to revisit and tweak them many times. It may be easier for you to simply parse out all the img tags and then do another expression on those tags to look for the 'alt' parameter.

Some online regular expression resources:
http://www.zvon.org/other/reReference/Output/
Another quick-reference. &#106avascript-regex oriented, but basic syntax is identical<br><a href="http://www.google.com/search?hl=en&lr=&q=regular+expression+reference&btnG=Google+Search">Google search</a>
That doesn't work, I don't think... suppose there is a src attribute and then an alt attribute: the src matches (asserts, really, since nothing is 'consumed') "something that's not an alt attribute", and then matches the rest of the tag successfully.

I think the best approach here is to match img, followed by zero or more attributes, where the "attribute" match negative-asserts 'alt'. Let's build it from the bottom up:

Attribute name:
[^>]+
Quoted item (attribute value):
['"][^>'"]*['"]
Attribute = attribute name, possible space, equals sign, possible space, and quoted item:
[^>]+\s*=\s*['"][^>'"]*['"]
Attribute not starting with alt:
(?!alt)[^>]+\s*=\s*['"][^>'"]*['"]
Some attributes not starting with alt, with spaces before them:
(?:\s+(?!alt)[^>]+\s*=\s*['"][^>'"]*['"])*
(I use the ?: construct to avoid capturing.)
All contents of the tag = open bracket, possible space, tag name "img", the previous regex (the "space" matched before the first-if-present attribute handles the space between 'img' and the first attribute), possible space, and close bracket:
<\s*img(?:\s+(?!alt)[^>]+\s*=\s*['"][^>'"]*['"])*\s*>

Whew. Sometimes it's better to use several expressions and/or combine them with programming logic. :)
Quote:Original post by Zahlman
That doesn't work, I don't think... suppose there is a src attribute and then an alt attribute: the src matches (asserts, really, since nothing is 'consumed') "something that's not an alt attribute", and then matches the rest of the tag successfully.


It does, and you can test it here, if you like
http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html

<img\s+(?![^>]*?\s+alt\s*=\s*['"]?)
will match "<img " and then 'consumes' all characters until it finds a closing ">" or an alt attribute.
Oh, I think I missed the ? in the '*?' near the beginning of that. :)

This topic is closed to new replies.

Advertisement