Home » Community » Forums » » Algorithmic Forays Part 6
  Intel sponsors gamedev.net search:   
[Control Panel] [Register] [Bookmarks] [Who's Online] [Active Topics] [Stats] [FAQ] [Search]

Add Forum to Favorites |  Send Topic To a Friend | View Forum FAQ | Track this topic


 Last Thread Next Thread 
 Algorithmic Forays Part 6
Post Reply 
Ok, I understand everything except the 5 states right at the end.

We have A, the start state. We then split that into eps-closure(move(A, a)) and eps-closure(move(A, b)). Those are our two new states, B and C. We then mark A.

Why don't we break B into eps-closure(move(B, a)) and eps-closure(move(B,b)), and do the same with C? What is the operation that creates the new set? In other words, we are generating D and E using move(B, ?) and move(C, ?). Or am I just totally off?

 User Rating: 1025   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

We do break B and C. See the following:

~~~~~~~~
... C is added, unmarked, to dfa_states. Since all the alphabet symbols are done for A, we go back to the outer loop. Are there any unmarked states ? Yes, there are B and C. So we pick B and go over the process again and again, until all the sets that are states of the DFA are marked. ...
~~~~~~~~

If this doesn't answer your question, please be more specific. What is the precise sentence you don't understand ?

 User Rating: 1023   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Hi,

I started a regex parser before I read these articles (and had it working), but I must say that this is so far a nice series. Regular expressions can be confusing at times so you've done a good job of trying to explain it in a bit simpler terms.

I have a question. I was gonna start putting in some features to my regex parser like the RepeatN structure (ie {m,n}) and charsets (ie [x-yz]) and was wondering if you'll consider doing an article or something describing those. For instance, I thought of a coupla ways to do the RepeatN structure. You could have a transition that has some values stating the min/max times it can be travelled and work with that or you could expand the NFA. (ie (a|b){2,4} --> (a|b)(a|b)((a|b)|eps)|(a|b)The second example obviously could become memory-intensive for the complex regexs. For the charset, how is that handled? For something like [a-z] would you have (a|b|c|d|e|f...|y|z) or something else?

If you could point me in the right direction, or give a link, on how to implement these things I would be very grateful!

Thanks,
Porthios

 User Rating: 1020   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by spur4444
We do break B and C. See the following:

~~~~~~~~
... Yes, there are B and C. So we pick B and go over the process again and again, until all the sets that are states of the DFA are marked. ...
~~~~~~~~


Right, but we should have two sets from B (move(B, a) and move(B, b)) and two sets from C (move(C, a) and move(C, b)) right? How do you know that you only need two more sets (D and E) instead of four more, and how are D and E generated (i.e. move(B, ?) and move(C, ?))?

 User Rating: 1025   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by sirSolarius
Right, but we should have two sets from B (move(B, a) and move(B, b)) and two sets from C (move(C, a) and move(C, b)) right? How do you know that you only need two more sets (D and E) instead of four more, and how are D and E generated (i.e. move(B, ?) and move(C, ?))?


Ill try my best to answer your questions, if im wrong, go ahead and correct me! :)

For your first question, the reason there's only two instead of four because we have to check to see if the set created from EpsilonClosure(Move(SET, input)) is already in the DFA table, and if it is, we wont add it to the unmarked set. That means That not every set generated from EpsilonClosure(Move(SET, input)) wil be required.

For your second, I believe they are created like so:
D = EpsilonClosure(Move(C, 'b'))
E = EpsilonClosure(Move(D, 'b'))

The answer to the first question explains why there is no set for EpsilonClosure(Move(C, 'a')) and EpsilonClosure(Move(D, 'a'))

Hope this helps!

 User Rating: 1020   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by sirSolarius
Quote:
Original post by spur4444
We do break B and C. See the following:

~~~~~~~~
... Yes, there are B and C. So we pick B and go over the process again and again, until all the sets that are states of the DFA are marked. ...
~~~~~~~~


Right, but we should have two sets from B (move(B, a) and move(B, b)) and two sets from C (move(C, a) and move(C, b)) right? How do you know that you only need two more sets (D and E) instead of four more, and how are D and E generated (i.e. move(B, ?) and move(C, ?))?


I don't *know* it - it's a result of the algorithm's run :-)

Read the algorithm (the pseudo code) again - and pay attention to the condition:

if U is not in dfa_states
add U to dfa_states, unmarked

When we generate a new U with:

U = eps-closure(N.move(T, i))

We don't just insert it into the DFA states, we first check that it's not already there. If you just go over the algorithm with B and C, you'll easily see this.

As an intuitive hint, note that B contains all states of C, except 5, but 5 leads to no state not already contained in B, so the real *interesting* transition is on 'b' from 8. This gives us set D. Then, again the only *interesting* transition is on 'b' from 9...

 User Rating: 1023   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

Quote:
Original post by porthios
Hi,

I started a regex parser before I read these articles (and had it working), but I must say that this is so far a nice series. Regular expressions can be confusing at times so you've done a good job of trying to explain it in a bit simpler terms.

I have a question. I was gonna start putting in some features to my regex parser like the RepeatN structure (ie {m,n}) and charsets (ie [x-yz]) and was wondering if you'll consider doing an article or something describing those. For instance, I thought of a coupla ways to do the RepeatN structure. You could have a transition that has some values stating the min/max times it can be travelled and work with that or you could expand the NFA. (ie (a|b){2,4} --> (a|b)(a|b)((a|b)|eps)|(a|b)The second example obviously could become memory-intensive for the complex regexs. For the charset, how is that handled? For something like [a-z] would you have (a|b|c|d|e|f...|y|z) or something else?

If you could point me in the right direction, or give a link, on how to implement these things I would be very grateful!

Thanks,
Porthios


As I see it, there are two ways to think about an implementation:

1) A simple way - create something that works. It's the best for educational purposes, and it's the approach I'm taking in Algorithmic Forays.

2) Optimize - various optimizations "over" the simple way. For instance, though the "real" tools (Lex, Perl...) use some variation of the algorithms I've presented, they definitely also have many optimizations and cut-offs, and hand-tailorings, etc. I'm not focusing on this approach.

As for your questions:

A charset is definitely a simple alternation. [a-c] is just (a|b|c).

Repeats can be achieved with eps, similarly to what you presented. For example: t{2,4} is tt(t|eps)(t|eps)

As for links/references, you can get the Mastering Regular Expressions book if you're really into the subject. You can also try your luck with the source code / internal docs of Flex.

 User Rating: 1023   |  Rate This User  Send Private MessageView Profile Report this Post to a Moderator | Link

All times are ET (US)

Post Reply
 Last Thread Next Thread 
Forum Rules:
You may not post new threads
You may post replies
You may not edit your posts
You may not use HTML in your posts
Jump To:
Administrative Options: