Question about parsing

Started by
4 comments, last by ApochPiQ 10 years, 4 months ago

What is the best way to approach parsing. If you take say some html source for example and your asked to parse out all file sizes which appear in the format xk,xxk or xxxk, k obvioulsy being kilobytes how do you best do this without parsing out similar text that may not be a filesize etc.

I take it there is no 100% bug free way of doing this. It's about finding common delimiters that allow you to isolate the terms your are trying to find.

For example you could search through the html source and firstly look for text that ends with the char 'k', then test that there is a numeric(s) before the k, then find the preceding whitespace to the left of the numeric and grap the filesize(text). Example html,

<script language = "js"> text here 200k more text here.
^^^^^
|||||
/ num \
(whitespace last)whitespace k(search for this first)

But you could get a case where there is the(xxxk) format elsewhere is the text which is not a filesize and hence parse data that is not valid etc.

So what is the best practice for finding delimters(or common points) where you can isolate the text/data you are needing to parse out.

Advertisement

You have to have a priority.

For example, quotes


"

are the highest priorities, followed by comment delimiters ( // or # ),

My way is to find opening symbols and quit on exiting symbols.

The first thing I do is remove comments, but to do that you need to check if you are inside a quote, so ignore anything inside quotes*.

Then I look for comment openning #, and exiting (new lines), deleting it.

Than depends on the formatting you chose, you will have to create your rules, and the users need to follow it, otherwise its a syntax error from the users.

You mention find the filesize, you have to specify some way to define it, any number with a k is definitely not a good way. Something like <filesize = 400k> would be better ( for the header: get the string after [ and before =, eliminate spaces on both ends, compare with string on code to see if its a "filesize", than for the value get all the string after = and before ], eliminate white spaces, than assign to an int.

*you also have the case where you want to put quotes inside quotes, than you need a special symbol, like \, the way I do it is to check for previous character when you find a quote, to be sure its a valid quote. Its simple and serve my needs. The symbol \ is generally used in pair with another symbol to specify special things.

I would use a regular expression such as [0-9]+[kK] and continue modifying the regex until no false positives are included.

If false positives are too hard to eliminate with a regex, I would likely use a full-blown LR parser.

I would suggest that you put your file size like IceBone mentioned directly in the brackets (XML-Format)


something like <filesize = 400k> would be better

if you have this done this way with all your specific information you could use for example boost::ptree. It would read your complete file

and you could easy access your data without touching this file directly by yourself. ptree helped me a lot because i am personally not very

good in handling files and their information in it.

But if you don't want do it on this way you should try regular expression:

I would use a regular expression such as [0-9]+[kK]

The most reliable and fastest way is to generate a parser using Flex/Bison or a similar alternative.

For languages that exist, as your seems to, there should already be grammars available for plugging into Flex/Bison to generate the type of parser you need.

L. Spiro

I restore Nintendo 64 video-game OST’s into HD! https://www.youtube.com/channel/UCCtX_wedtZ5BoyQBXEhnVZw/playlists?view=1&sort=lad&flow=grid

You need to examine the inputs to your code. Trying to bulletproof a parsing or regular-expression matching problem is extremely difficult (if not impossible) if you don't know much about the data you're feeding into it.

What does the input look like? Is it well-formatted HTML? Is it well-formatted XML? i.e. can you use an existing parsing system to narrow down the specific chunks of text you need to examine? Are there consistencies to the "real" needle you're trying to find in the haystack? Are there common inconsistencies?

Basically, you just need to understand what you're going to be looking at. This could be even simpler than a regular expression if the input data is formatted nicely. A fairly straightforward regular expression will handle most other cases, with some false positives that you might be able to tune out (or might not).

Also, a full blown parser is ludicrous overkill for this problem in virtually any case. That's kind of like building your own car and racetrack so you can "move" to the refrigerator to get a drink. Yes, cars and racetracks have to do with moving things, and moving your butt to the fridge has to do with moving too. Extracting needles from a text haystack has to do with things that vaguely overlap with parser theory. But they're really not the same at all, and trying to solve one problem with the wrong tool is going to waste a lot of time and effort. It'll probably also be wrong.

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

This topic is closed to new replies.

Advertisement