Sign in to follow this  
Maxamor

Parsing a file name from URL with regex

Recommended Posts

Maxamor    361
Hi folks, I'm working on something where I have a list of URLs and need to extract the file name without the extension. I am not experienced with regular expressions enough to get just what I need, but here is what I have so far:
Given the URL: http://kraid/folder/someotherfolder/Guide/2562.html

[^/]*$      this will find the file name and extenstion.
\.[^\.]*$   this will find the extension and the period.
I could use both of these and it would probably work, but all I really need is the "2562" of the html. I cannot match against the rest of the url (hard-code it) because I need to reuse this for multiple directories. It would be great if I could match the inverse of [^/]*$ and replace it with "" because then I could just match against the extension and have my magic number. I don't believe this is too difficult to solve, it's been hard to find a solid resource that provides examples...

Share this post


Link to post
Share on other sites
hydroo    295
It's kind of sad. I could easily do it with c-string functions.
You just made me want to know regexp a bit better.
(sorry, if couldn't help you - I am trying atm :P)

€:
I cannot do it in one step, but
[0123456789] gets you all numbers - in vim.
It could also be [0123456789]*.

If you use this as the second regexp, it should work - given there is no number in the extension.

[Edited by - hydroo on May 15, 2008 1:48:12 PM]

Share this post


Link to post
Share on other sites
Maxamor    361
Another way to look at it is I need to match the section between the last occurrence of "/" and the last occurrence of ".". The problem is that they are matched from left to right, so it ends up matching the first "/" and getting everything up to the extension.

Share this post


Link to post
Share on other sites
AndiDog    145
I don't know a lot about regex, too, but this might help you:

www.regular-expressions.info
PCRE Workbench

If you have a library that enables you to get not only the matching text but also the so-called "capture groups", you can use the following regex (might be pretty dumb ^^) and take the capture group #1, it results in "2562".

.*/([^./]*)\.*


The only capture group in this regex is the part which stands in parentheses!

Share this post


Link to post
Share on other sites
Splo    133
I love regex, honestly, that's just so much fun! :)

Yeah for your problem, you may need to use capture groups which are really useful and powerful.
Example:
http://[^/]+/(.+/)([^/]+)\.(.+)$

Applied to your URL, and using the following string for replacement:
Path: \1 | File: \2 | Extension: \3

The result:
Path: folder/someotherfolder/Guide/ | File: 2562 | Extension: html
So \1 gives what has been captured by the first pair of parenthesis, \2 by the second one, etc.

Applied to:
http://mysite.com/path/to/the/target.directory/test/the.result.html
(Watch out these dots inside the URL!)
Result:
Path: path/to/the/target.directory/test/ | File: the.result | Extension: html

Quote:
Original post by hydroo
I cannot do it in one step, but
[0123456789] gets you all numbers - in vim.
It could also be [0123456789]*.

If you use this as the second regexp, it should work - given there is no number in the extension.
[0-9] is faster to type than [0123456789]. :)

[Edited by - Splo on May 16, 2008 3:24:04 AM]

Share this post


Link to post
Share on other sites
Nypyren    12061
Would a greedy dot solve the leading slash problem?

I've only really used a lot of Regex's in .Net, so if this isn't the same format as what you need, I apologize. Also, I'm not sure if the backslash is required inside a character selector to escape a dot, either.

.*/(?<capture>[^\\.]+)

Share this post


Link to post
Share on other sites
Splo    133
Here's a simpler version that takes care about filenames with dots before the extension (such as "foo.bar.html"):
.*/([^/]+)\.[^\.]+$

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this