Parsing a file name from URL with regex

Started by
7 comments, last by Splo 15 years, 11 months ago
Hi folks, I'm working on something where I have a list of URLs and need to extract the file name without the extension. I am not experienced with regular expressions enough to get just what I need, but here is what I have so far:
Given the URL: http://kraid/folder/someotherfolder/Guide/2562.html

[^/]*$      this will find the file name and extenstion.
\.[^\.]*$   this will find the extension and the period.
I could use both of these and it would probably work, but all I really need is the "2562" of the html. I cannot match against the rest of the url (hard-code it) because I need to reuse this for multiple directories. It would be great if I could match the inverse of [^/]*$ and replace it with "" because then I could just match against the extension and have my magic number. I don't believe this is too difficult to solve, it's been hard to find a solid resource that provides examples...
Advertisement
It's kind of sad. I could easily do it with c-string functions.
You just made me want to know regexp a bit better.
(sorry, if couldn't help you - I am trying atm :P)

€:
I cannot do it in one step, but
[0123456789] gets you all numbers - in vim.
It could also be [0123456789]*.

If you use this as the second regexp, it should work - given there is no number in the extension.

[Edited by - hydroo on May 15, 2008 1:48:12 PM]
Another way to look at it is I need to match the section between the last occurrence of "/" and the last occurrence of ".". The problem is that they are matched from left to right, so it ends up matching the first "/" and getting everything up to the extension.
Yeah you have to include the $ in some slick way. - dunno how myself :(
I don't know a lot about regex, too, but this might help you:

www.regular-expressions.info
PCRE Workbench

If you have a library that enables you to get not only the matching text but also the so-called "capture groups", you can use the following regex (might be pretty dumb ^^) and take the capture group #1, it results in "2562".

.*/([^./]*)\.*


The only capture group in this regex is the part which stands in parentheses!
(?<magic>[^/]+)\..+$

I think that will capture what you want it in the "magic" group.
That's .NET Regex syntax, not sure what other regex's use for capturing groups.

- The trade-off between price and quality does not exist in Japan. Rather, the idea that high quality brings on cost reduction is widely accepted.-- Tajima & Matsubara
I love regex, honestly, that's just so much fun! :)

Yeah for your problem, you may need to use capture groups which are really useful and powerful.
Example:
http://[^/]+/(.+/)([^/]+)\.(.+)$

Applied to your URL, and using the following string for replacement:
Path: \1 | File: \2 | Extension: \3

The result:
Path: folder/someotherfolder/Guide/ | File: 2562 | Extension: html
So \1 gives what has been captured by the first pair of parenthesis, \2 by the second one, etc.

Applied to:
http://mysite.com/path/to/the/target.directory/test/the.result.html
(Watch out these dots inside the URL!)
Result:
Path: path/to/the/target.directory/test/ | File: the.result | Extension: html

Quote:Original post by hydroo
I cannot do it in one step, but
[0123456789] gets you all numbers - in vim.
It could also be [0123456789]*.

If you use this as the second regexp, it should work - given there is no number in the extension.
[0-9] is faster to type than [0123456789]. :)

[Edited by - Splo on May 16, 2008 3:24:04 AM]
Would a greedy dot solve the leading slash problem?

I've only really used a lot of Regex's in .Net, so if this isn't the same format as what you need, I apologize. Also, I'm not sure if the backslash is required inside a character selector to escape a dot, either.

.*/(?<capture>[^\\.]+)
Here's a simpler version that takes care about filenames with dots before the extension (such as "foo.bar.html"):
.*/([^/]+)\.[^\.]+$

This topic is closed to new replies.

Advertisement