Export to GitHub

google-highly-open-participation-psf - issue #323

Add support for the 're' module using PCRE


Posted on Jan 9, 2008 by Helpful Giraffe

Download and install Shed Skin, and read the included README for usage instructions:

http://shedskin.sourceforge.net

Especially read the part about how to implement libraries. Have a look at the lib/ directory for examples of several standard library module implementations.

Download and install PCRE, the Perl-Compatible Regular Expression library:

http://www.pcre.org

Using PCRE, add support for basic 're' functionality (i.e., add lib/ re.py and lib/re.?pp), and illustrate the compatibility between 're' and PCRE by compiling a test program that uses most basic 're' features.

You'll need to have experience with C to pull this off!

Completion:

Submit a patch as an attachment to this ticket.

Task duration: please complete this task within 5 days (120 hours) of claiming it.

Comment #1

Posted on Jan 9, 2008 by Happy Camel

Comment deleted

Comment #2

Posted on Jan 9, 2008 by Grumpy Rhino

I claim this task.

Comment #3

Posted on Jan 9, 2008 by Helpful Giraffe

(No comment was entered for this change.)

Comment #4

Posted on Jan 9, 2008 by Happy Camel

good luck! ;)

Comment #5

Posted on Jan 9, 2008 by Grumpy Rhino

I've got a working version down (with windows, too [!]) which supports the following methods:

match_object::expand match_object::group match_object::start match_object::end re_object::match re_object::search compile match search

Comment #6

Posted on Jan 9, 2008 by Grumpy Rhino

(so essentially I've added support for expand, start and end, and got it working with windows finally :-P)

Comment #7

Posted on Jan 9, 2008 by Happy Camel

good work!

did you implement expand yourself, or did you convert a python version somehow? I don't know how difficult it is to build the other methods on pcre, but perhaps it's useful to use this approach in one or more cases..?

have you seen anything that might be particularly hard to add, or any incompatibilities between re and pcre?

Comment #8

Posted on Jan 9, 2008 by Grumpy Rhino

so far it appears to me that pcre supports everything python does

I made the expand function myself; it was pretty simple, and even simpler with C++ strings. friday I'll implement at least split and findall via the pcre callout functionality (can't guarantee anything tomorrow b/c we've got exams)

Comment #9

Posted on Jan 11, 2008 by Grumpy Rhino

I've added split, but I have a question about findall. the documentation says that
in the case of multiple captured subpatterns it returns a list of tuples. I checked and didn't see anything, so I was wondering if it's possible with your library to create + populate a tuple2 object when the size isn't known before compilation (so it'd be constant, just unknown). maybe like a setitem function or something?

Comment #10

Posted on Jan 11, 2008 by Happy Camel

sure, you can just create a 'new tuple2()', and then add things to the underlying STL vector (if x is a pointer to the object, use x->unit.push_back(..))

but the problem here seems to be that the return type of findall is dynamic! sometimes it's a list of strings, sometimes a list of tuples of string :P so it cannot be supported at all..

maybe it would be best to throw an exception here when you detect there are multiple groups? it's always possible to use finditer and then match.group(x), so users can easily code around it..

Comment #11

Posted on Jan 11, 2008 by Happy Camel
tuple2<str *, str*> *t = new tuple2<str *, str *>();
t->units.push_back(new str("blah"));
t->units.push_back(new str("bleh"));

Comment #12

Posted on Jan 11, 2008 by Happy Camel

actually, with just an exception there will still be possible type inference problems, if the user expects tuples. maybe it's better if you skip findall for now?

Comment #13

Posted on Jan 11, 2008 by Grumpy Rhino

yeah, heh. somehow I had figured with c++'s function overloading there was a way around that. guess things like that what happens when I'm really tired... :-P

so I've done everything in re_object except findall, and there're a few things in match_object which I haven't done yet. the relevant files are attached.

Attachments

Comment #14

Posted on Jan 12, 2008 by Happy Camel

great, good work! one thing I was worrying about was that re.cpp would become rather big, but clearly that hasn't happened.. :)

if arguments to a type model are dynamic, but not the return type, their types don't interfere with the result and we can often use a manual C++ template function. (I just did this for getopt.gnu_getopt, and there are a few other examples).

what would you say if we assume findall returns a list of strings, and have the compiler always give a warning when it is used (warning assuming re.findall returns list of strings (use re.finditer for multiple groups)), and we also throw an exception when there are multiple capturing patterns? so it will work fine in most cases, and just generate a warning.

Comment #15

Posted on Jan 13, 2008 by Grumpy Rhino

I think this is everything...

I added the flag -lpcre.dll in LFLAGS field of FLAGS, but I'm not sure if that's different for linux systems...

Attachments

Comment #16

Posted on Jan 13, 2008 by Grumpy Rhino

oh yeah, I noticed that your string concatenation function __add_strs assumed all its parameters to be non-null, which lead to a crash if one of them does happen to be a 'None' type. you probably want to throw an exception (like python does) if it encounters a null-type or something.

Comment #17

Posted on Jan 13, 2008 by Happy Camel

I'm impressed, and marking your task as completed. I'll test your patch later this week and commit it to SVN. btw, do you have a test program as well? I'd also like to add that to unit.py..

yes, there are probably other places where checks should be added.. it's good practice to test code with CPython first, and only compile it as a last step, but if checks are easy and don't take much time, they should be added.

of course now I have to ask - would you be interested in doing another SS task, or in staying with the project after the GHOP? I could surely use your help!

Comment #18

Posted on Jan 13, 2008 by Happy Camel

oh, could you please document the steps it took to get pcre working under windows?

Comment #19

Posted on Jan 13, 2008 by Happy Camel

I had a quick scan of re.py. shouldn't re_object.groupindex have a type (e.g., dict of string to int?)

Comment #20

Posted on Jan 13, 2008 by Grumpy Rhino

er, yeah, guess so :-P

I managed to already find a bug and something I forgot to implement (groupdict and groups), so I'll reupload in a tiny bit

Comment #21

Posted on Jan 13, 2008 by Grumpy Rhino

think everything's fixed now...

I've also attached a test script which I whipped up, but you're welcome to change it if you want

for compilation on windows I put libpcre.dll.a in the /lib directory (the top level one), libpcre-0.dll in a place accessable by the compiled application (eg. in the same directory as it), and added -lpcre.dll to LFLAGS in FLAGS

I might be up to one more task, but preferrably one that's not so big. I have some other things I want to work on, exams the rest of this week then a short ski trip during the weekend, so my schedule is sort of full :-P I could help out some with development, but as a warning I don't really like commitments, and a lot of the time I have a project of my mine that needs work as well.

Attachments

Comment #22

Posted on Jan 13, 2008 by Happy Camel

I tried it here under Ubuntu, using -lpcre, and it seems to work fine. but when I diff the python and compiled versions, I get a few minor differences:

< ['BoB@gmaiL.com', 'sally123_43.d@hOtmail.co.uk']

[('BoB', 'gmaiL.com'), ('sally123_43.d', 'hOtmail.co.uk')] 10d9 < user: bob 13a13 user: bob 17,19d16 < pass: hoho < user: haha < path: /files 21a19,21 user: haha pass: hoho path: /files

Comment #23

Posted on Jan 13, 2008 by Happy Camel

btw, where did you get these dlls, or how did you make them?

Comment #24

Posted on Jan 13, 2008 by Happy Camel

hmm. how about adding support for the time module, or part of it? if I'm correct, this is mostly a wrapper around standard C calls.

Comment #25

Posted on Jan 13, 2008 by Grumpy Rhino

lol, that first difference is b/c of our findall solution :-P it's actually supposed to return an array of tuples (as you see in the second line), but since dynamic return types aren't really feasable it only returns an array of strings (the first line). the other differences are simply because the dict iterator traversed the dictionary in a different order from python, which is an utter non-issure in my book b/c dictionaries are associative not ordered anyhow.

and I compiled the libraries (it's actually a static library, libpcre.dll.a, and a dynamic one, libpcre-0.dll) with mingw. I originally tried to link it statically (with libpcre.a), but I wasn't able to get that to work so I fell back to dynamic linking...

Comment #26

Posted on Jan 13, 2008 by Grumpy Rhino

the time module might work, but I wouldn't be able to guaruntee a starting time (heh)

Comment #27

Posted on Jan 13, 2008 by Happy Camel

okay, good! thanks.

I am planning on requesting a few more tasks, including one for the time module. I'll let you know if/when they come online.

Status: Completed

Labels:
C thirdparty coding shedskin ClaimedBy-SirNot Due-20080114.1000