Sign in to follow this  

Splitting a string with regex

This topic is 2838 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Recommended Posts

I am using Qt and trying to parse a simple INI file. I want to split each line up by commas, taking all characters that are in quotes as a string literal (i.e like C style strings). Here is are some examples of what my INI file contains:
1234, "text", "more text"
"textWithNoSpaces", "text with spaces, and even a comma"
I would like my parsing function to return an array of strings (QString) containing the text in and out of quotes, e.g:
s1 = "1234"  s2 = "text" s3 = "more text"   // Note that these quotes are not part of the string
s1 = "textWithNoSpaces" s2 = "text with spaces, and even a comma"

The problem is that Qt (rather surprisingly) does not have a simple substring function that takes a start and end index, like std::string's substr. What it does have, however, is a more powerful split function that takes a regular expression. So i would like to know how i can use regex to split each line by commas but preserve the commas withing quotes (and also trim any extra whitespace between separating commas). thanks in advance

Share this post


Link to post
Share on other sites
I've never used Qt before, but what you want it something similar to .NET's System.Text.RegularExpressions.Regex.Matches, which returns a collection of substrings which match against a regular expression (or emulate the functionality). Off the top of my head your regex might look something like this:

\w*((\"(.*?)\",?)|(([.^\"]*?),?))

But I'm no regex expert, so it could be completely wrong (can't even remember if '"' needs to be escaped), and it doesn't handle cases where you want to have escaped quotes be included as part of a token. However the basic idea is that you need to eat whitespace, try to match a quoted or unquoted string, and handle any special escaped characters you want to support. If you can iteratively find each matching substring, you're set.

Share this post


Link to post
Share on other sites
Zipster's regex looks pretty accurate, and if you're interested in learning more about regex this is a site I've used heavily in the past:

http://www.regular-expressions.info/

Although for your situation regex seems like a bit overkill. I've never used Qt before, but the first result of a search for "Qt substring" points me to a function called "mid" that is defined as:

"Returns a string that contains the len characters of this string, starting at position i."

which sounds exactly like what you were looking for.

QString::mid

Share this post


Link to post
Share on other sites
Thanks for the help Zipster and karwosts (karwosts, i almost missed your post, you posted it the second i hit the reply button [grin]).

@karwosts: The QString::mid function is exactly what i was initially looking for. Very poorly named, though, mid suggests that it finds the middle index or something.
I will have a look at that website, there is also extensive regex documentation with Qt (they have excellent documentation). I need to learn how to use regex better.

@Zipster: Thanks for the regex, it should get me started with it. I need to keep looking at the Qt docs to see if there is a function like that .NET one. The split function takes a regex as the string to split at (so in my case it would be the comma and whitespace). I'm not sure if i can tell it to split at a comma except if it's within some quotes. In other words, i think it would be better to try and match each string withing the separating commas, if that makes sense.

Anyway, i'll have a look at it tomorrow. If all else fails, i will just write my own little parser. Shouldn't be too hard.

Share this post


Link to post
Share on other sites
Fyi, i have found out how to get a list of matches using Qt, and have written regex to do so (partially thanks to the site karwosts posted):

\\s*((?:\"[^\"]*\")|(?:\\w|\\d)+)\\s*(,|$)


The code gives the string to the regex object then asks it for the text it has matched. Then it moves the index along to find the next match, and loops until no more are found:

QStringList parseLine(String line) {
QStringList strings;
String str;

QRegExp rx("\\s*((?:\"[^\"]*\")|(?:\\w|\\d)+)\\s*(,|$)");
int pos = 0;
while ((pos = rx.indexIn(line, pos)) != -1) {
str = rx.cap(1);
// Quotes are also matched, need to strip them
if (str.startsWith('\"') && str.endsWith('\"'))
str = str.mid(1, str.size()-2);
strings << str;
pos += rx.matchedLength();
}

return strings;
}


I guess in the end i could have just written a simple parser by looping through each character and checking for commas and quotes, but the regex solution is a lot neater and more robust. It may be slower, but i only need to parse the text once.

Share this post


Link to post
Share on other sites

This topic is 2838 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

If you intended to correct an error in the post then please contact us.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

Sign in to follow this