Regular expressions to tokenize

Started by
5 comments, last by Deflinek 7 years, 5 months ago

Hi everyone.

I hope you can give me a hand with this:

I have a list of tuples like this:


table = [
("aaa", "a","b", 0),
("aaa", "a", "c", 1),
("aaa", "b", "a", 2),
("aaa", "b", "c", 3),
("aaa", "c", "a", 4),
("aaa", "c", "b", 5),
("aaa", "a", "*", 6),
("aaa", "b", "*", 7),
("aaa", "c", "*", 8),
...
]

More less like that. I't huge.
Now, the * means one digit.
So, as you can see, this table holds some kind of (not so)regular expressions
The idea is that if you write "aaa a b" you get 0. If you write "aaa c 1" you get 8.

The program actually works. But i want to change it to use python regular expressions.

I managed to write the regular expressions to match the strings and keep it in groups:


r'(?P<matcha>aaa[\s]+(a|b|c)[\s]+(a|b|c|[\d]))'

This, matches all the tuples in the example table.

My question.

Is there a way to get an specific integer from a match(like the one in the table)

Or maybe translate the match into the "table-regular-expression-format".

Advertisement
It looks like some sort of assignment, but why not throw a PLY scanner at it?
That does all the hard work for you.

Otherwise, I am not quite convinced that a RE is a good solution for sequence recognition when you're in a hurry.

I think I'm missing something here.

If you want to retrieve the fourth element of the tuple based on the other three, is there any reason for not using a good ol' dictionary and having "aaa a b" as the key and 0 as the value?

Also, how is a regular expression that matches all the possible strings you are using going to help to retrieve the number associated with a specific string?

Try to avoid regular expressions whenever possible. They are very powerful for what they are designed for (esuring a text matches pattern) but are often overused hurting performance and readability. Your case is not what regex is for. There is no source text and no pattern to match. You will have problems later trying to extend your solution or debugging bizarre edge cases.

Try to avoid regular expressions whenever possible.

That's an error in the other direction. Use regex when it's the right tool and avoid it when it's not. This is a case where it's definitely not.

I think Avalander is on the right path here with just having an associative container, but I think OP may have a very wrong idea about how regex is typically used.

void hurrrrrrrr() {__asm sub [ebp+4],5;}

There are ten kinds of people in this world: those who understand binary and those who don't.

Ok. Thanks for the advice.

I think i will use regular expression for matching some of the "generic characters": numbers to *, etc.

I thought i could use re to get the number.

By the way, thanks for helping me see the a dict is far better than a list of tuples.

That's an error in the other direction


Yes, but I've just seen it too many times. Dev learns about regex then "Wooaa! Shiny! I can do so many things with that!". And you get abominations like parsing HTML to get page title. Bloated beyond repair to eliminate false positives in headers, comments and js. Thanks, but no thanks :) I would rather err in this direction and use old fashioned search if it's viable and use regex only when I actually gain anything always sacrificing readability.

This topic is closed to new replies.

Advertisement