Jump to content
  • Advertisement
Sign in to follow this  
kingnosis

Regex advice

This topic is 4845 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Basically, I'm trying to find all the HTML image tags without the alt text attribute. Rather than manually searching through 130 FrontPage generated HTML files, I was hoping I could find a regex expression that would point me to the offenders. Tha problem is that I just started learning about regular expressions about an hour ago. So is it even possible to get a match on "<img", but only when there's no "alt" on the same line? The closest thing I can come up with so far is
<img(?!alt)
but this will only match when there is an "alt" and it's not right after the "<img". Can anyone offer any help? Am I on the right path? Thanks.

Share this post


Link to post
Share on other sites
Advertisement
You likely want to look into the use of . and *

Though for things like that, I prefer to use perl [or a second grep call] and then just run a secondary search on the lines that match <img. Much more straightforward.

Share this post


Link to post
Share on other sites
Hmm. Assuming your regex implementation supports negative lookahead assertions, you could do something like this:

<img\s+(?![^>]*?\s+alt\s*=\s*['"]?).*?>

Breaking it down:

opening "<img" tag followed by at least one whitespace character
<img\s+

Anything that isn't ">" or an alt attribute. (this is the negative lookahead assertion)
(?![^>]*?\s+alt\s*=\s*['"]?)

breaking that down further, the alt=" match looks for whitespace preceeding an "alt" followed by any ammount of whitespace followed by "=" followed again by any ammount of whitespace, and then an optional quote character. This allows the text "alt" to appear within any other attribute, so long as it's not an attribute itself.. This will match the following:
alt=cows
alt = cows
alt = "cows"
alt='cows'
etc..
but not:
src="http://.../alternates/"
src="alt.gif"




And then matching anything up to the first closing ">" The '?' after .* makes it a non-greedy match.
.*?>

Regular expressions are very powerful, I highly recommend getting a good reference book on them, or spending some time on the internet looking at the online tutorials and playing with it. Like most things, practice will make you better.

Disclaimer: The above expression may be bugged.. there are many situations where it might not work, in my experience, any time I've used complex expressions I've had to revisit and tweak them many times. It may be easier for you to simply parse out all the img tags and then do another expression on those tags to look for the 'alt' parameter.

Some online regular expression resources:
http://www.zvon.org/other/reReference/Output/
Another quick-reference. javascript-regex oriented, but basic syntax is identical
Google search

Share this post


Link to post
Share on other sites
That doesn't work, I don't think... suppose there is a src attribute and then an alt attribute: the src matches (asserts, really, since nothing is 'consumed') "something that's not an alt attribute", and then matches the rest of the tag successfully.

I think the best approach here is to match img, followed by zero or more attributes, where the "attribute" match negative-asserts 'alt'. Let's build it from the bottom up:

Attribute name:
[^>]+
Quoted item (attribute value):
['"][^>'"]*['"]
Attribute = attribute name, possible space, equals sign, possible space, and quoted item:
[^>]+\s*=\s*['"][^>'"]*['"]
Attribute not starting with alt:
(?!alt)[^>]+\s*=\s*['"][^>'"]*['"]
Some attributes not starting with alt, with spaces before them:
(?:\s+(?!alt)[^>]+\s*=\s*['"][^>'"]*['"])*
(I use the ?: construct to avoid capturing.)
All contents of the tag = open bracket, possible space, tag name "img", the previous regex (the "space" matched before the first-if-present attribute handles the space between 'img' and the first attribute), possible space, and close bracket:
<\s*img(?:\s+(?!alt)[^>]+\s*=\s*['"][^>'"]*['"])*\s*>

Whew. Sometimes it's better to use several expressions and/or combine them with programming logic. :)

Share this post


Link to post
Share on other sites
Quote:
Original post by Zahlman
That doesn't work, I don't think... suppose there is a src attribute and then an alt attribute: the src matches (asserts, really, since nothing is 'consumed') "something that's not an alt attribute", and then matches the rest of the tag successfully.


It does, and you can test it here, if you like
http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html

<img\s+(?![^>]*?\s+alt\s*=\s*['"]?)
will match "<img " and then 'consumes' all characters until it finds a closing ">" or an alt attribute.

Share this post


Link to post
Share on other sites
Sign in to follow this  

  • Advertisement
×

Important Information

By using GameDev.net, you agree to our community Guidelines, Terms of Use, and Privacy Policy.

We are the game development community.

Whether you are an indie, hobbyist, AAA developer, or just trying to learn, GameDev.net is the place for you to learn, share, and connect with the games industry. Learn more About Us or sign up!

Sign me up!