Skip to content Skip to sidebar Skip to footer

Non Overlapping Pattern Matching With Gap Constraint In Python

I want to find total no. of non-overlapping matches of a pattern appearing in a sequence, with the gap constraint 2. Eg. 2982f 2982l 2981l is a pattern found using some algorit

Solution 1:

This can be solved elegantly with regex. We just have to convert the pattern into a regex and then count how often that regex matches in the input sequence.

For example, given the input pattern = 'A B C' and max_gap = 2, we want to create regex like

A(arbitrary_word){,2}?B(arbitrary_word){,2}?C

Matching arbitrary words separated by spaces can be done with (?:\S+\s+), so we get:

import re

defcount_matches(pattern, seq, max_gap):
    parts = map(re.escape, pattern.split())
    sep = r'\s+(?:\S+\s+){{,{}}}?'.format(max_gap)
    regex = r'\b{}\b'.format(sep.join(parts))
    returnsum(1for _ in re.finditer(regex, seq))

Test runs:

count_matches('2982f  2982l  2981l', '2982f  2982f  2982l  2982l  2981l', 2)
# result: 1

count_matches('A B C', 'A B D E C B A B A B C', 2)
# result: 2

Solution 2:

I think you've done the hard part. Now just loop through the indices of the first word looking for indexes of the second word that are less than the gap, and so on. Then go back to the indices of the first word and if you found a match last time, skip any indices that fall into that match.

For example, here's the solution with only two words in pt:

i=0while i < len(pt_dic[pt_split[0]]):
    ii = pt_dic[pt_split[0]][i]
    #print "ii=" + str(ii)
    j=0while j < len(pt_dic[pt_split[1]]):
        jj = pt_dic[pt_split[1]][j]
        #print "jj=" + str(jj)if jj > ii and jj <= ii + 2:
            print"Match: (" + str(ii) + "," + str(jj) + ")"# Now that we've found a match, skip indices within that match.
            i = next(x[0] for x inenumerate(pt_dic[pt_split[0]]) if x[1] > jj)
            i -= 1# counteract the increment at the end of the outer loopbreak
        j += 1
    i += 1#print "i=" + str(i)

And with three words in pt:

i=0while i < len(pt_dic[pt_split[0]]):
    match=False
    ii = pt_dic[pt_split[0]][i]
    #print "ii=" + str(ii)# Start loop at next index after ii
    j = next(x[0] for x inenumerate(pt_dic[pt_split[1]]) if x[1] > ii)
    while j < len(pt_dic[pt_split[1]]) andnot match:
        jj = pt_dic[pt_split[1]][j]
        #print "jj=" + str(jj)if jj > ii and jj <= ii + 2:

            # Start loop at next index after jj
            k = next(x[0] for x inenumerate(pt_dic[pt_split[2]]) if x[1] > jj)
            while k < len(pt_dic[pt_split[2]]) andnot match:
                kk = pt_dic[pt_split[2]][k]
                #print "kk=" + str(kk)if kk > jj and kk <= jj + 2:
                    print"Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + ")"# Now that we've found a match, skip indices within that match.
                    i = next(x[0] for x inenumerate(pt_dic[pt_split[0]]) if x[1] > kk)
                    i -= 1# counteract the increment at the end of the outer loop
                    match=True
                k += 1
        j += 1
    i += 1#print "i=" + str(i)

And with four words in pt:

i=0while i < len(pt_dic[pt_split[0]]):
    match=False
    ii = pt_dic[pt_split[0]][i]
    #print "ii=" + str(ii)# Start loop at next index after ii
    j = next(x[0] for x inenumerate(pt_dic[pt_split[1]]) if x[1] > ii)
    while j < len(pt_dic[pt_split[1]]) andnot match:
        jj = pt_dic[pt_split[1]][j]
        #print "jj=" + str(jj)if jj > ii and jj <= ii + 2:

            # Start loop at next index after ii
            k = next(x[0] for x inenumerate(pt_dic[pt_split[2]]) if x[1] > jj)
            while k < len(pt_dic[pt_split[2]]) andnot match:
                kk = pt_dic[pt_split[2]][k]
                #print "kk=" + str(kk)if kk > jj and kk <= jj + 2:

                    # Start loop at next index after kk
                    l = next(x[0] for x inenumerate(pt_dic[pt_split[3]]) if x[1] > kk)
                    while l < len(pt_dic[pt_split[2]]) andnot match:
                        ll = pt_dic[pt_split[3]][l]
                        #print "ll=" + str(ll)if ll > kk and ll <= kk + 2:
                            print"Match: (" + str(ii) + "," + str(jj) + "," + str(kk) + "," + str(ll) + ")"# Now that we've found a match, skip indices within that match.
                            i = next(x[0] for x inenumerate(pt_dic[pt_split[0]]) if x[1] > ll)
                            i -= 1
                            match=True
                        l += 1
            k += 1
        j += 1
    i += 1#print "i=" + str(i)

I think the pattern has been established now, so generalisation to an arbitrary number of words left as exercise for reader!

Post a Comment for "Non Overlapping Pattern Matching With Gap Constraint In Python"