My favorites | Sign in
Logo
                
New issue | Search
for
| Advanced search | Search tips
Issue 1601: match trouble and tutorial
2 people starred this issue and may be notified of changes. Back to list
Status:  Accepted
Owner:  smichr
Type-Enhancement
Priority-Medium
Matching


Sign in to add a comment
 
Reported by smichr, Aug 14, 2009
This is quoted from  issue 1429 :
    a = Wild('a', exclude=[f(x)])
    b = Wild('b')
    p = a*f(x) + b
    eq1 = sin(x)
    eq2 = f(x) + sin(x)
    eq3 = f(x) + x + sin(x)
    eq4 = x + sin(x)
    print eq1.match(p)
    print eq2.match(p)
    print eq3.match(p)
    print eq4.match(p) # This one does not work

to match or not to match...as I talked this over with asmuerer I found this
to be a very challenging thing to understand. It seems like it should be so
easy to understand this pattern matching. I've been successfully greping
for years, but the behavior above lets you know there is maybe a little
more to it than you first think. And if you try to trace the code to see
what is going on...well, be patient and think recursively. So, after
generating all combinations of expression terms and pattern terms and
seeing which ones failed and which ones succeed and after looking at the
code, some clarity that I hope to preserve for posterity has come out of
the process.

Oh, before we get started, there is expr.match(pattern) syntax and
pattern.matches(expression) syntax which I rememeber the order by looing
for the ES in matchES and think of it being near the "ES" or "ESPRESSION"
(as they say in some parts of the world :)

Here is the expression matching algorithm:
If the pattern isn't the same as the expression then
    1) the pattern is split into wild and non-wild parts
    2) the non-wild parts are moved in a manner (consistent with the
pattern type, e.g. subtracted from a pattern that is an Add) to the
expression. (wild parts are also moved if they have been recognized as a
replacement in the process or if they appear in the expression being
tested.) You may still end up with non-wild parts and wild parts combined:
3 + w*x (where w is a wild symbol) will have the 3 subtracted from the
expression but the x will not also be moved, thus the wild pattern that
will be tested is the w*x.
    3) now for each term in the expression--in the order they appear when
printed (which is the reverse order that they are given by args)--each
piece of the wild pattern (in the same order) is checked to see if it
matches the expression. If it does, that wild term in the pattern is
replaced with what it matched and a test is done to see if the resulting
pattern yet matches the expression. If it does, you are done, otherwise you
test the next wild part of the pattern. If you go through all the patterns
without success, you go on to the next part of the expression and repeat
the tests of the wild parts. And finally, if you haven't successfully
converted the pattern to the expression through matched parts then the
match has failed.

So, what will 3 + w*x match? letting m be what it matches, we can write:
3 + w*x = m
w*x = m - 3
w = (m - 3)/x

Now, if there are not restrictions on w, m can be anything. 3+w*x will
match sin(x)+pi giving w = (sin(x)+pi-3)/x. If you were using the pattern
w*x you were probably didn't want an expression like (sin(x)+pi-3)/x coming
backk for w. If not, you have to put and exclusion on w saying what can't
appear in (m-3)/x. e.g. if you make w = Wild('w', exclude=[x]). Then
neither the sin(x) nor the x can be on the right, so m cannot have x in it.
Not x, not sin(x) nor f(x).diff(x) which all give True when tested with "x
in " as "x in sin(x) == True". So what *can* m be? g*x + 3 where g is
anything not containing x. e.g. sin(y) or 2 or (2+1/y), etc...it can be
anything BUT the 3 must be there for without it, the x will not vanish when
the 3 of the pattern is subtracted and the result is divided by x.

So let's do the same thing with the patterns above:

a = Wild('a', exclude=[f(x)])
b = Wild('b')
p = a*f(x) + b
eq1 = sin(x)
eq2 = f(x) + sin(x)
eq3 = f(x) + x + sin(x)
eq4 = x + sin(x)
print eq1.match(p)
print eq2.match(p)
print eq3.match(p)
print eq4.match(p)

let's write the pattern as w + nf*f(x) where w is wild and nf is wild with
the "not f(x)' exclusion.
w + nf*f(x) = sin(x) = eq1
there are no constant parts, so start testing the patterns. In order to
know what to do next you have to know what order the wild parts will come
out of the pattern. Remember that they are processed in the same order as
they appear when printed, so we test the w first:
does w match sin(x)? yes w = sin(x), now
  replace w with sin(x) and check
      nf*f(x) + sin(x) = f(x) + sin(x), now subtract constants...
      nf*f(x) = 0
      nf = 0/f(x) = 0
      and nf matches so 
    nf = 0 and w = sin(x), check
        sin(x) + 0*nf = sin(x) is True and we are done

With less commentary now,
w+nf*f(x) =? f(x) + sin(x) = eq2 [args processed L to R]
    test f(x)
        w =? f(x) YES, replace w with f(x) in pattern; check if equal
            nf*f(x)+f(x) =? f(x) + sin(x)
            nf*f(x) =? sin(x)
            nf =? sin(x)/f(x) NO go to next wild part
        nf*f(x) = f(x)
        nf = 1 YES, replace and check
            w+1*f(x) =? f(x) + sin(x)
            w =? sin(x) YES
                sin(x)+1*f(x) = f(x) + sin(x) YES and done.
                ==> w=sin(x) and nf = 1

w+nf*f(x) =? x + f(x) + sin(x) = eq3 [args processedL to R]
    test x
        w =? x YES
            x+nf*f(x) =? x + f(x) + sin(x)
            nf*f(x) =? f(x)+sin(x)
            nf = (f(x)+sin(x))/f(x); NO by exclusion
        nf*f(x) =? x
        nf =? x/f(x) NO by exclusion
    test f(x)
        w=? f(x) YES
            f(x) + nf*f(x) =? x + f(x) + sin(x)
            nf*f(x) =? x + sin(x)
            nf =? (x + sin(x))/f(x) NO by exclusion
        nf*f(x) =? f(x)
        nf = 1 YES
            w+f(x) = x + f(x) + sin(x)
            w = x + sin(x) YES and done
            w ==> x + sin(x) and nf = 1

w+nf*f(x) =? x + sin(x) = eq4 [args processedL to R]
    test x
        w = x YES
            x+nf*f(x) =? x+sin(x)
            nf*f(x) = sin(x)
            nf = sin(x)/f(x) NO by exclusion
        nf*f(x) = x
        nf = x/f(x) NO
    test sin(x)
        w = sin(x) YES
            sin(x)+nf*f(x) = x + sin(x)
            nf*f(x) =? x
            nf = x/f(x) NO
        nf*f(x) = sin(x)
        nf = sin(x)/f(x) NO
    ==> no match

So if you want to know if a pattern will match something, print the two
quantities 
and process what you see from left to right as follows:

        match(pattern,expr):
            if pattern == expression:
                pattern succeeded
            else:
                move any constants in pattern to expr
                    for term in expr as printed L to R
                        for p in pattern as printed L to R
                            if match(p,term):
                                replace p in pattern with term
                                if match(pattern,expr):
                                    pattern succeeded
                    pattern failed

If you've followed this far you might have noticed that basically, having
an exclude on a quantity means that it will have to appear in the
expression unless some other term will have already knocked it out. This
happened in eq1 with the w knocking out the sin(x) so the f(x) didn't have
to match against anything and the pattern succeeded. But in eq4 the x got
processed first and matched to the w; the expr doesn't go to zero, however
and so an f(x) *had* to be present in the remaining expression (or else it
would be impossible for f(x) to divide the expression and cancel out). So
the pattern failed.

That's tedious to think about. And there's a moral to the story. You
*probably* just want to know if an expr matches a certain form, answering a
question like (A) "is it linear?" or (B) "what are the coefficients on
certain terms?" 

If you are doing (A) you can make your life a lot easier if you just put
and exclusion on *every* symbol to keep it from matching anything that is
in the non-wild parts of the pattern. If your non-wild terms are all
functions of x then just having exclude=[x] will keep wild patterns from
matching anything with x in it. So if you want to know if an expression is
a+b*x...make a and b exclude x, e.g. Wild('a', exclude =[x]) and the same
for b.  This will match 3, a*x, x+3, a*x+3, but not a+x**2 (since there is
no way to knock out bout x's with the pattern as long as constants exclude
x). You can also make sure your pattern succeeds if you use all wild
characters with no exclusions. But if you do, then it is likely that the
wild patterns will "eat" part of your pattern or give you matches where the
non-wild parts of your pattern are on the rhs--you'll have to test for
that. e.g. u_*x will give u_ = x if matched against x**2. 

If you are doing (B) there may be a much faster and simpler way to get what
you want using as_independent(). If you already know the terms that you are
interested in (or a range of terms, say, in a polynomial or differential
equation) then ask for the coefficients of the terms of interest as in:

>>> eq
3 + 2*x + 4*D(f(x), x) + f(x)
>>> eq.coeff(x)
2
>>> eq.coeff(f(x).diff(x))
4

Or you can get them all as in:

>>> for a in eq.args:
...     print a.as_independent(x)
...     
(3, 1)
(4, D(f(x), x))
(1, f(x))
(2, x)

For the expression, 
    eq=3 + x + f(x)+4*f(x).diff(x)
and the pattern with all non-x wilds excluding f(x)
    pa = u+v*x+w*f(x)+z*f(x).diff(x)

A is about 100 x faster than loop B:

A)
    for a in eq.args:
        a.as_independent(x)
B)
    ok=eq.match(pa)



>>> eq.as_base_exp()
(3 + 2*x + 4*D(f(x), x) + f(x), 1)
>>> eq.as_basic()
3 + 2*x + 4*D(f(x), x) + f(x)
>>> eq.as_coeff_exponent(f(x))
(3 + 2*x + 5*f(x), 0)
>>> eq.as_coeff_exponent((x))
(3 + 2*x + 4*D(f(x), x) + f(x), 0)
>>> eq.as_coeff_exponent(3)
(3 + 2*x + 4*D(f(x), x) + f(x), 0)
>>> eq.as_coeff_factors()
(3, (4*D(f(x), x), f(x), 2*x))
>>> eq.as_coeff_terms()
(1, (3 + 2*x + 4*D(f(x), x) + f(x),))
>>> eq.as_coefficient(x)
>>> eq.as_coefficient(f(x))
>>> eq.as_coefficient(S(2))
>>> eq.as_independent(x)
(3, 2*x + 4*D(f(x), x) + f(x))
>>> eq.as_independent(f(x))
(3 + 2*x, 4*D(f(x), x) + f(x))
>>> eq.as_leading_term()
3 + 2*x + 4*D(f(x), x) + f(x)
>>> eq.as_powers_dict()
{3 + 2*x + 4*D(f(x), x) + f(x): 1}
>>> eq.coeff(f(x))
5
>>> eq.coeff((x))
2
>>> eq
3 + 2*x + 4*D(f(x), x) + f(x)

Perhaps there is interest in fleshing out part of the documentation with
modifications of the above.

/c
Comment 1 by asmeurer, Aug 14, 2009
Yep, this is pretty much what I saw when I stepped through it in the debugger.  It seems like there should be a 
way for the pattern to check to see if it can still match the x to b after matching sin(x) in (4), instead of just 
throwing it out.  Thanks for pointing out .as_independent().  I was writing my own function that did that for 
another part of my code.

Also, I think args are ordered based on hash values, whereas printing is based on Basic._compare_pretty.  So they 
are not necessarily reverse of each other (args ordering can be different on different machines that compute 
hashes differently).  

Maybe you should convert this into a Sphinx doc and put it in the docs.  It would have been nice to have 
whenever I was trying to figure out how match works.  

I searched for papers on matching mathematical expressions and found:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.139

Maple's docs don't cite any papers for their pattern matching.  

I wonder, how do other open-source CAS's do it?  
Comment 2 by smichr, Aug 15, 2009
That's useful to know about the ordering issue...just to be precise, the terms of the
pattern and the expression are procesed in reverse order of how they come out of
args, so,
  for w in reversed(pattern.args):
    etc...

As far as whether to match that remaining x after matching the sin(x)...that's, I
believe, the same sort of issue you have with greedy or non-greedy regex. If the
first pattern takes the maximal bite then there may be nothing left for other parts
of the pattern. I think that's why it goes term-wise.

/c
Comment 3 by asmeurer, Nov 21, 2009
This also gives an interesting approach based on series expansion:
http://www.inf.ethz.ch/personal/gonnet/CAII/HeuristicAlgorithms/node31.html
Comment 4 by Ronan.L...@gmail.com, Dec 02 (6 days ago)
(No comment was entered for this change.)
Labels: Matching
Sign in to add a comment

Hosted by Google Code