Sign in to follow this  

Parsing a file name from URL with regex

This topic is 3500 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

Hi folks, I'm working on something where I have a list of URLs and need to extract the file name without the extension. I am not experienced with regular expressions enough to get just what I need, but here is what I have so far:
Given the URL: http://kraid/folder/someotherfolder/Guide/2562.html

[^/]*$      this will find the file name and extenstion.
\.[^\.]*$   this will find the extension and the period.
I could use both of these and it would probably work, but all I really need is the "2562" of the html. I cannot match against the rest of the url (hard-code it) because I need to reuse this for multiple directories. It would be great if I could match the inverse of [^/]*$ and replace it with "" because then I could just match against the extension and have my magic number. I don't believe this is too difficult to solve, it's been hard to find a solid resource that provides examples...

Share this post


Link to post
Share on other sites
It's kind of sad. I could easily do it with c-string functions.
You just made me want to know regexp a bit better.
(sorry, if couldn't help you - I am trying atm :P)

€:
I cannot do it in one step, but
[0123456789] gets you all numbers - in vim.
It could also be [0123456789]*.

If you use this as the second regexp, it should work - given there is no number in the extension.

[Edited by - hydroo on May 15, 2008 1:48:12 PM]

Share this post


Link to post
Share on other sites
Another way to look at it is I need to match the section between the last occurrence of "/" and the last occurrence of ".". The problem is that they are matched from left to right, so it ends up matching the first "/" and getting everything up to the extension.

Share this post


Link to post
Share on other sites
I don't know a lot about regex, too, but this might help you:

www.regular-expressions.info
PCRE Workbench

If you have a library that enables you to get not only the matching text but also the so-called "capture groups", you can use the following regex (might be pretty dumb ^^) and take the capture group #1, it results in "2562".

.*/([^./]*)\.*


The only capture group in this regex is the part which stands in parentheses!

Share this post


Link to post
Share on other sites
I love regex, honestly, that's just so much fun! :)

Yeah for your problem, you may need to use capture groups which are really useful and powerful.
Example:
http://[^/]+/(.+/)([^/]+)\.(.+)$

Applied to your URL, and using the following string for replacement:
Path: \1 | File: \2 | Extension: \3

The result:
Path: folder/someotherfolder/Guide/ | File: 2562 | Extension: html
So \1 gives what has been captured by the first pair of parenthesis, \2 by the second one, etc.

Applied to:
http://mysite.com/path/to/the/target.directory/test/the.result.html
(Watch out these dots inside the URL!)
Result:
Path: path/to/the/target.directory/test/ | File: the.result | Extension: html

Quote:
Original post by hydroo
I cannot do it in one step, but
[0123456789] gets you all numbers - in vim.
It could also be [0123456789]*.

If you use this as the second regexp, it should work - given there is no number in the extension.
[0-9] is faster to type than [0123456789]. :)

[Edited by - Splo on May 16, 2008 3:24:04 AM]

Share this post


Link to post
Share on other sites
Would a greedy dot solve the leading slash problem?

I've only really used a lot of Regex's in .Net, so if this isn't the same format as what you need, I apologize. Also, I'm not sure if the backslash is required inside a character selector to escape a dot, either.

.*/(?<capture>[^\\.]+)

Share this post


Link to post
Share on other sites
Here's a simpler version that takes care about filenames with dots before the extension (such as "foo.bar.html"):
.*/([^/]+)\.[^\.]+$

Share this post


Link to post
Share on other sites

This topic is 3500 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this