# Algorithm to reduce string via rules?

This topic is 2177 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

This has to be a really common problem, but I can't for the life of me figure out what it's called in order to google it. Any pointers appreciated :)

In essence, I have an input string along the lines of "ABAAACBAABCBBA", and a set of rules for replacing substrings that look like "AA->B", "BCB->A"... (rules may be arbitrarily long substrings, but always result in a single token output).

Given an input string and a set of rules, I need to figure out (a) the unique order in which to apply the rules (potentially recursively) to arrive at a single token result, or (b) if no such ordering exists, or (c) if more than one such ordering exists.

The input strings may be of arbitrary length, but are generally somewhere between 5 and 30 tokens. Rules are generally short (2-8 tokens), but there may be many thousands of rules, and hundreds of distinct tokens in the grammar. It is however acceptable to pre-process the set of rules rules.

Bonus points for pointers to a solution that runs in reasonable time complexity...

##### Share on other sites

Sort the tokens first, longest length followed by alphabetical for fast matching.

Longest length is important so you match AAA before AA. Unless it's a weird problem where the aim is to find the shortest encoding rather than a left to right encoding (so, like tokenisation)?

EDIT: If it's not tokenisation left to right (i.e. more like compression?) it's going to be pretty complex I reckon.

##### Share on other sites

Sort the tokens first, longest length followed by alphabetical for fast matching.

Sorry, I failed to specify - the order of the input string is important. Or do you mean to sort the rules?

Longest length is important so you match AAA before AA. Unless it's a weird problem where the aim is to find the shortest encoding rather than a left to right encoding (so, like tokenisation)?

Overwhelming goal is to reduce to one token. There are cases where longer matches may not achieve that goal:

input: AAAB
rules: (1) AAA->B, (2) AA->A, (3) AB->B

Applying rule (1) never results in a single token solution, even though a single token solution exists.

For small inputs, brute force exploration of all possible rule substitutions may be feasible. Looking for a method with better performance characteristics given large numbers of rules.

##### Share on other sites

Yeah I meant sort the rules.

It's a hard problem then, it's sounding like a tree search algorithm like AI/game theory stuff...

EDIT: Alvaro needs some kind of bat signal, I bet this sort of thing is right up his street.

##### Share on other sites
You might be able to get away with a tree search algorithm, but this smells suspiciously like it could be NP-hard depending on the inputs and rules.

##### Share on other sites
I would use a GLR parser for this. Swap the left and right sides of the replacement rules and you get grammar productions. With some tweaking you don't need to differentiate between terminals and nonterminals (realize that shifts and gotos in a table-based parser are mechanically identical). GLR will handle the ambiguities.

This particular GLR parser will handle nasty cyclic rules as well:
http://kohsuke.org/compling/tomita/ Edited by Nypyren

##### Share on other sites

What you described is a context free grammar, though the rules are usually written like:

B -> AA
A -> BCB

and you need a start symbol, which in this simple case would look like:

S -> A | B | C

Since its a context free grammar, you have a whole plethora of algorithms to choose from.  For simple grammars linear time parsers can be written of various complexity: LL, SLR, LR, LALR, and various deviations.  More complex grammars (one's that can have recursion and/or ambiguous rules) require more complex parsers: tomita (as mentioned above) as well as CYK chart parser (surprisingly easy to program) and a few others.  Worst case time complexity is cubic.

No point in going into much more detail here, as there is a TON of research on context free grammars (being the most popular and most used).  I personally have written a LR, LL, CYK and most of a tomita parser.  If you need to write a parser from scratch, and it has to handle full generic context free grammars, I'd suggest go with CYK, they are much easier to write than the others I found.

##### Share on other sites

and you need a start symbol, which in this simple case would look like:
S -> A | B | C

Ok, I had the feeling this reduced to a parsing problem, but somehow I failed to make the cognitive leap to make a start production containing all the rules.

Having found time to actually sit down and implement a brute-force solver, this problem isn't as hard as I had automatically assumed. Complexity is cubic allright, but the inputs are small enough that probably doesn't matter much...

#!/usr/bin/env python3

def _solve(input_string, rules, route):
matched = False
for index, rule in enumerate(rules):
if input_string.find(rule[0]) >= 0:
matched = True
next = input_string.replace(rule[0], rule[1], 1)
_solve(next, rules, route + [index])

if len(input_string) == 1:
print('success', input_string, route)
elif not matched:
print('failure', input_string, route)

def solve(input_string, rules):
_solve(input_string, rules, [])

if __name__ == '__main__':
import sys

if len(sys.argv) < 3:
print('usage: %s <input-string> <rule-1> <rule-2> ... <rule-n>')
sys.exit()

input_string = sys.argv[1]
rules = []
for arg in sys.argv[2:]:
rule, substituite = arg.split(':')
rules.append( (rule, substituite) )

print('input: %s, rules: %s' % (input_string, rules))

solve(input_string, rules)


##### Share on other sites

I don't know python, but from what I can see that would be exponential worst case time complexity, not cubic.  Granted you rarely hit worst case, so its probably fine for small inputs/simple grammars.  All generic context free parses essentially do a recursive search of the solution.  But what separates them from the a naive search is that they 'coalesce' identical subtrees.  In the case of CYK they use a pyramid like structure/chart with 'buckets' or cells.  In the case of tomita they use a parallel stack or direct graph.  In both cases the data structures are required to prevent spurious operations, its what limits them to cubic.

Though for what you're using it for I'm sure its fine.

##### Share on other sites

EDIT: Alvaro needs some kind of bat signal, I bet this sort of thing is right up his street.

That was funny. My wife wants you to know I have three children under three, so I am rather busy.

In this case, Nypyren and Ryan have covered the answer very well. I actually only know of one general algorithm to parse a string using an arbitrary context-free grammar, and it runs in time O(n^3), where n is the length of the input. I don't know its name, but it's basically the straight-forward dynamic-programming algorithm you come up with if you try to solve all the problems of the form "what is the set of letters that I can reduce this substring to?" for each substring.

I think I'll write some code while the kids are in nap time. ;)

• 18
• 29
• 11
• 21
• 16