[java] I need some help with regex...

Started by
8 comments, last by Oluseyi 16 years, 5 months ago
Hello all, Alright, since most of you will suspect this, I'll just say it. This is for a homework problem, but regex isn't part of the homework. They want us to parse a file by reading in each line and then parse it manually by going through each character. I am trying to do it using regular expressions just because it seems to be a more elegant solution and a good opportunity to learn about them. Ok, so the file contains info about books in this format: "The title goes here" [publishing year] [price] "Author's name here" The quotes are actually there around the title and the author. So, I'll start with the title, I am doing something like this: String title = lineScan.next("\".*\""); where lineScan is a Scanner object for the current line in the file. That expression \".*\" means a string with at least one character between quotes right? But when I run this, it throws a "java.util.InputMismatchException" exception. It works if I do this though: lineScan.useDelimiter("\""); String title = lineScan.next(); But then I try and change the delimiter back to whitespace like this: lineScan.useDelimiter("\\s+"); That means any number of spaces doesn't it? but then it throws a NumberFormatException because it reads in extra whitespace it looks like. Anyway, can anybody give me some help with this? I have read the Java API docs on regular expressions, the scanner object, all that; but I could have missed or misunderstood something. Thanks, Svenjamin
Advertisement
Ugh, the regex and string escapes...

The regex string you're looking for can be written in single line. As it happens, the Scanner tutorial gives you the source:
Scanner s = new Scanner(input);        s.findInLine( ??? );        MatchResult result = s.match();        for (int i=1; i<=result.groupCount(); i++)            System.out.println(result.group(i));
The ??? is what you must figure out - and yes, it's a single string.

The problem with quoted text is that by matching ., quote apparently gets matched too. Hard to tell, since there's no output. So to match everything within quotes, search for just that - "everything that is not quote".

If you want to match something, the it'll usually be in the form: " (\\d+) " - everything outside of () will be matched literally.

So your final expression will look something like the Scanner example:
"\"(???)\" [(???)] [(???)] \"(???)\"";

??? is for you to find out - those are the regexes that will match the variable parts, I'm telling too much already.


Also - homework may or may not be about exceeding yourself. Sometimes, getting something done on your own matters more, so an elaborate solution you cannot fully explain (and regex is a very annoying topic) might not have the expected results.
Hey, Thanks for the quick reply.
I think I can figure it out now, I didn't even think that the . might include quotes as well. Thanks for pointing that out. Thanks for not giving the solution as well, I will enjoy figuring it out myself.

Also, I appreciate your advice about homework, But this is just an intro to Java course. I have been doing C++ for a few years, so there is nothing new conceptually and I enjoy trying to make things harder this way. But again thanks for the advice.

Thanks again for the help,
Svenjamin
Ok, I'm back. I think I've got the expressions I'm going to use figured out, but I'm having some trouble implementing them. I have been testing different expressions using this program.
So to extract the title which is surrounded by quotes I use this expression:

\"[\w[\s]]+\"

And that works like a charm in the test program, but then when I try and write it into my code like this:

String title = input.next("\"[\\w[\\s]]+\"");

I still get the input mismatch exception.
I also tried just printing out the results using the method in the example that you pointed out, but it throws another exception saying that no results were found.
Is there something I am missing when putting the expression into the code?

Thanks,
Svenjamin
I think you have to escape your backslashes as well as the quotes because \" is also a escape for c strings (which I assume is the same for Java, BTW, sorry about my Java ignorance [smile])

so try:

String title = input.next('\\"[\\w[\\s]]+\\"');

or

String title = input.next("\\\"[\\w[\\s]]+\\\"");

Edit: This post has been heavily modified from the original because I found more problems than the ones I first noticed.
Thanks for the reply Kwizatz,
Unfortunately, the first method you suggested doesn't compile in java, only one character can be between single quotes. And the second method still throws the exception. Thanks for trying though, I appreciate it.

Svenjamin
Quote:Original post by Kwizatz
I think you have to escape your backslashes as well as the quotes...

No.

The issue is that regex, by default, is greedy. For example:
# this is Perl, because regex is most elegant in Perl [smile]$str = "\"Ponderous Book Title: Opaque Subtitle\" [2007] [100] \"Sonaiya, Oluseyi\"";$str =~ /\".*\"/;print "$&\n";

The output is "Ponderous Book Title: Opaque Subtitle" [2007] [100] "Sonaiya, Oluseyi". The regex will match any character, as indicated by .* until matching the next character would break the regex. What I need to do is tell it to match the smallest valid sequence, or make it non-greedy:
# this is Perl, because regex is most elegant in Perl [smile]$str = "\"Ponderous Book Title: Opaque Subtitle\" [2007] [100] \"Sonaiya, Oluseyi\"";$str =~ /\".*?\"/;      # NOTE THE QUESTION MARK!print "$&\n";

The output this time is "Ponderous Book Title: Opaque Subtitle".

Enlightenment beckons (with help, if your Perl-fu is weak).
Alright, what you are saying about greediness and that makes sense, but I didn't think it would matter since neither \w nor \s match the double quote character right? So wouldn't it terminate anyway as soon as it hits the second quote?

Aside from that, I tried this:

input.next("(\"[\w[\s]]+\"){1}");

but that still didn't work. I'm not sure it is correct syntax though. There seem to be a lot of subtle nuances with regex syntax.

Thanks for the help so far,
Svenjamin

EDIT:

The first title in the file is "The Poky Little Puppy", so I tried doing:

input.next("\"The Poky Little Puppy\"");

And even that threw the exception still. Any insights there?

EDIT 2:
I solved it! (sort of) I gave up on the next() method because it still threw the exception even if I used the exact string for the pattern. Anyway, I used the findInLine(Pattern p) method to extract the different parts.

Thanks for all your help,
Svenjamin

[Edited by - Svenjamin on October 31, 2007 5:17:31 PM]
Quote:Original post by Oluseyi
Quote:Original post by Kwizatz
I think you have to escape your backslashes as well as the quotes...

No.

The issue is that regex, by default, is greedy. For example:
# this is Perl, because regex is most elegant in Perl [smile]$str = "\"Ponderous Book Title: Opaque Subtitle\" [2007] [100] \"Sonaiya, Oluseyi\"";$str =~ /\".*\"/;print "$&\n";


This is why I gave a pretty strong hint in:
Quote:"everything that is not quote"


Since this is solved now, the expression I used for the original problem was:
s.findInLine("\"([^\"]+)\" \\[(\\d+)\\] \\[(\\d+)\\] \"([^\"]+)\"");


The "any character" is implied, so for quoted strings, I simply match "not quote".
Quote:Original post by Svenjamin
Aside from that, I tried this:

input.next("(\"[\w[\s]]+\"){1}");

You have a lot of superfluous nesting there. \"[\w\s]+\" is sufficient.

Glad you found the execution problem, though. [smile]

This topic is closed to new replies.

Advertisement