data:image/s3,"s3://crabby-images/3eeab/3eeaba752d8c500f904237eb5b0eb38811279730" alt=""
google-highly-open-participation-psf - issue #341
Add support for fnmatch and glob to Shed Skin
Download and install Shed Skin, and read the included README for usage instructions:
http://shedskin.sourceforge.net
Especially read the part about how to implement libraries. Have a look at the lib/ directory for examples of several standard library module implementations.
Add support for fnmatch and glob, by compiling glob.py and fnmatch.py (as used by CPython). This involves changing them a bit so they can be compiled.
Completion:
Submit a patch for each module as an attachment to this ticket.
Task duration: please complete this task within 5 days (120 hours) of claiming it.
Comment #1
Posted on Jan 21, 2008 by Grumpy RhinoI claim this task
Comment #2
Posted on Jan 21, 2008 by Happy Bear(No comment was entered for this change.)
Comment #3
Posted on Jan 21, 2008 by Happy Camelboth use 're' iirc, so you can use your own re module.. :-)
Comment #4
Posted on Jan 22, 2008 by Grumpy Rhinohmm, so it looks like posix-compliant systems have built-in support for both glob and fnmatch, but apparently windows dosn't... shall I just write them from scratch? (it'd be more interesting, anyhow)
Comment #5
Posted on Jan 22, 2008 by Happy Camelwhy not take glob.py and fnmatch.py, as used by python.. ? (see attachment!)
- glob.py 1.96KB
- fnmatch.py 2.95KB
Comment #6
Posted on Jan 23, 2008 by Grumpy RhinoI've implemented a working prototype in C, which I would attach here but it seems that I've exceeded my attachment storage quota... (huh, didn't expect that'd happen) you can find it here instead: http://sirnot.110mb.com/c/glob.c
it's only ~4x faster than the python (///* took ~0.14s vs. ~0.64s), which dosn't seem too great to me, but I'm not sure if that's a consequence of the extensive file enumerations (ie. lots of system calls), or just a crappy algorithm on my part. I'd appriciate anyone's opinion on it before I start integrating it into shedskin.
Comment #7
Posted on Jan 23, 2008 by Grumpy Rhinowait, scratch that. I guess I sort of overthinked it; a recursive implementation seems better in the end...
Comment #8
Posted on Jan 23, 2008 by Grumpy Rhinook there we are. still the same performance, though...
Comment #9
Posted on Jan 23, 2008 by Happy Camelyour C coding is impressive, but the assignment is not about coding.. :-) for several reasons, compiling modules from their original source is highly preferrable:
-it's always good to 'eat your own dogfood' -it minimizes the chance of bugs and discrepancies with CPython -it helps locate bugs and potential optimizations in shed skin -'autogenerate whenever you can' - I already have 20,000 lines of code to maintain :)
I appreciate your efforts, but I just don't think it is a good approach to code support for these modules manually. if you really want to, you can complete your task this way, but I will probably replace your code later then on by autogenerated code..
Comment #10
Posted on Jan 23, 2008 by Grumpy Rhino*grumbles incoherently....
alright, I'll try compiling it with shedskin...
some bugs that I've encountered with the path module implementation (so far): - OSError is not declared with errno attribute (actually should be derived from EnvironmentError), yet such attribute is used in makedirs of init.cpp - getmtime is prototyped as returning a double yet in the windows implementation returns an integer - something wacky's happening with mkdir; for some reason it's prototyped in io.h as having only one argument...
Comment #11
Posted on Jan 23, 2008 by Happy Camel:-) please keep a list of all problems you encounter.
about the problems you mention here:
-OSError has an __ss_errno attribute instead of an errno attribute. to avoid really annoying problems, I prefix every identifier that might be a C macro on some platform with __ss_. this has no effect on Python code (e.g., 'e.errno' becomes translated as 'e.__ss_errno') -I think getmtime returns whatever type os.stat_float_times indicates.. I tried stat_float_times on Windows and it said 'float'. anyway, to avoid dynamic typing, shed skin has to choose either floats or ints here.. -yes, I saw the argument problem with mkdir under windows.. please let me know if you have a better solution.
Comment #12
Posted on Jan 24, 2008 by Grumpy Rhinoso the compiled glob is a bit better than python -- for //// it ran at a tad more than ~3/4 the speed -- but is still at least 3 times slower than my 'hand- coded' version...
the relavent files are at http://sirnot.110mb.com/c/fnglob.zip (again thanks to my quota)
Comment #13
Posted on Jan 24, 2008 by Happy Camelgood work! task completed, code committed to SVN. you didn't run into any bugs?? well, it's not a bug, but I notice that the generator class should have its attributes grouped by type..
you don't think it's extremely cool to use your own re module and a python-to-C++ compiler to bootstrap a few more modules..?? :-)
I don't think speed is really important for these modules. but of course shed skin should generate as fast code as possible. do you see anything that shed skin could do to improve performance?
Comment #14
Posted on Jan 24, 2008 by Grumpy Rhinohaha, well it was sort of neat when it worked, but it still feels like cheating :-P
nothing immidiately strikes me as easily-optimizable (although I'm not looking really hard). I think the main problem as far as optimization is concerned is that, in directly translating python to C++, your effectively creating code which approaches a problem in C++ with a python mindset. obviously, considering the level- gap, the best approach in python isn't often going to be the best method in C/C++.
one think I did notice though (which isn't a huge deal at the moment) is your variable usage. instead of reusing temporary variables it looks like you're just making new ones, which could pile up in bigger programs. for example, in the following section of compiled code from glob:
if (has_magic(basename)) { FOR_IN(dirname,dirs,5) FOR_IN_SEQ(name,glob1(dirname, basename),7,9) //bla bla bla END_FOR END_FOR } else { FOR_IN(dirname,dirs,11) FOR_IN_SEQ(name,glob0(dirname, basename),13,15) //bla bla bla END_FOR END_FOR }
maybe during compilation you could have a stack of unused variables of each type, and only make new ones if there isn't one available?
Comment #15
Posted on Jan 24, 2008 by Grumpy RhinoI reuploaded a better version in which I went through and removed all the unneeded temporary variables (there were actually a lot of them declared that weren't even used)
Comment #16
Posted on Jan 25, 2008 by Happy Camelsure, manual C is always faster. but I'm much more interested in closing the performance gap between Python and 'badly written' C++ than between C++ and manual C.. of course you can always change your style a bit so translation to C++ becomes more effective. the nice thing now is that you are still using pure Python!
about cleaning up the code - this work will also disappear when I regenerate fnmatch/glob at some point (e.g. because of adding some optimization to shed skin). and don't you think C++ compilers already do optimizations like this..?
Comment #17
Posted on Jan 25, 2008 by Grumpy Rhinoit's not so much that the C++ is 'badly written', it's that in directly translating python to C++, you often get suboptimal implementations. Eg. with glob, translating it literally yielded the needless creation of several linked lists. I had figured you might as well'd want to optimize the libraries, but if you'd rather save work and compile them all that's fine as well.
Comment #18
Posted on Jan 25, 2008 by Happy CamelI don't think speed is very important for these two modules.. so I prefer to have them in a more maintainable state, at least for now. but of course I am somewhat biased towards autogenerating! :-)
btw, Python lists are not implemented as linked lists, but as STL vectors (practically as efficient as C arrays).
if you'd like to help out more with Shed Skin, please let me know. I can probably come up with a few more nice tasks.. :)
Comment #19
Posted on Jan 26, 2008 by Happy Camelsomething that has a lot of manual C and should be extremely efficient is lib/builtin.?pp. if you're interested, you can probably find lots in there that can still be optimized further.. see the shedskin-discuss thread called 'performance of different loop styles'.. :)
Status: Completed
Labels:
C
thirdparty
coding
shedskin
ClaimedBy-SirNot
Due-20080126.2100