[Math] String matching automata preprocessing

algorithmsautomata

I have an alphabet A = {a,b,c} and a pattern P = "abcaab".

The task is to build a finite automaton of the transition function (delta) for {0,6} (the length of the pattern) and each element of the alphabet A.

My incorrect solution:

enter image description here

New correct solution:

enter image description here

I've highlighted the field that I am having trouble with in yellow. States run along the top and my alphabet along the side.

Although I believe my solution is correct I don't understand how that field is calculated.

The rest seem obvious since if there isn't a match return to the initial state 0. If 'a' is the next input return to 1 unless in states 3 or 4. However whilst in the accepting state if the next input is 'b' I need to return to state 2.

I know this is because 'a' is guaranteed to before the 'b' otherwise state 6 would not have been reached. What I don't know however is how this is pre-calculated in the COMPUTE-TRANSITION-FUNCTION algorithm.

More details can be found on page 996-1002 of Introduction to Algorithms Third Edition.

I used the table above to match strings in the following text:

enter image description here

Best Answer

It might be easiest to think of this as minimizing a NFA that recognizes .*abcaab. The NFA has one state for each position in the search string you could have reached yet, and the states in your DFA will consist of sets of the NFA states. By coincidence (because you're only searching for a fixed string) each of the DFA states can be identified by the highest NFA state it contains, and you'll then only need to keep track of which other states you could also have reached.

Thus, in your state 4 you have already seen something that ends with abca, but that really mean that what you have seen so far ends with abca and a both. So If you see an a in this situation you move to state 5 (abcaa), but if you see a b in state 4, you need to discard abca and work on the NFA state b. So a b in state 4 should lead to state 2 rather than to state 0. Otherwise you won't find the match in abcabcaaba.

Also, if you get to state 6 you have already seen all of abcaab. If you move to state 2 for a b in this situation you will have seen abcaabb with two bs at the end, which is not what state 2 expects. Is that a typo in your transition table? I notice that your example moves to state 3 rather than state 2 here.

Depending on exactly how your automaton formalism works, I suspect that you don't actually want to have a state 6 at all. Instead seeing a b in state 5 should accept and move to state 2 because once you have matched abcaab you still have an ab at the end that could be the start of yet another match.

Alternatively, you might want to move to state 6 and have state 6 be an accepting state with exactly the same transitions out of it as for state 2.

Related Question