I often need to make line-by-line modifications to large files (1,000,000 lines or more), but pyp is not currently suited to this as it reads in the complete input before producing output. Would it be possible to add a mode for pyp that produces line-by-line output without loading the entire file at once? Obviously this would prohibit some operations, such as those involving 'pp', but much useful functionality would remain.
Comment #1
Posted on Mar 12, 2012 by Grumpy LionThat's a good point. I looked into this when I started, but it had a number of drawbacks. I'll revisit when i get a chance...If i can find a straightforward way of doing this, maybe with a "--turbo" flag, I'll incorporate it. I'm definitely open to input from the open source community regarding this as well.
Comment #2
Posted on Mar 15, 2012 by Grumpy LionHi, please try this beta and let me know how it goes...use --quick_output for large files.
- pyp_beta_2.11.1 90.11KB
Comment #3
Posted on Mar 17, 2012 by Grumpy LionI've tested it with commands like
cat big.file | pyp --quick_output "t[0]+'\t'+t[1]+'\t\t'+t[2]"
and it works as expected. Thanks!
Comment #4
Posted on Apr 3, 2012 by Grumpy LionI found a couple of cases for which passing -q leads to some missing lines of output:
$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp -q "rel(r'^\d [23]')" 1 1
$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp "(int(w[1]) not in {2,3})" 1 1 4 4 5 5
$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp -q "(int(w[1]) not in {2,3})" 1 1
$ for i in 1 2 3 4 5; do echo "$i $i"; done | pyp "rel(r'^\d [23]')" 1 1 4 4 5 5
Comment #5
Posted on Apr 3, 2012 by Grumpy Lionok, thanks for the update on the beta. I'll look into this.
Comment #6
Posted on Apr 4, 2012 by Grumpy LionI think the issue is that Pyp.n is not being incremented, so safe_eval() does nothing after the first line that evaluates to False. But I don't understand the code well enough to know how to fix it.
Comment #7
Posted on Apr 14, 2012 by Grumpy LionOk,please try this on. Curly brackets don't for me, but the command is essentially the same. Let me know if it works. This is the latest beta that should also deal with unintentional stripping. It's still beta though, so please let me know if you see any weirdness.
for i in 1 2 3 4 5; do echo "$i $i"; done | pyp_beta_2.11.5.py -q "(int(w[1]) not in [2,3])" 1 1 4 4 5 5
- pyp_beta_2.11.5.py 90.92KB
Comment #8
Posted on May 16, 2012 by Grumpy Lionnew pyp_beta should fix this: http://code.google.com/p/pyp/downloads/detail?name=pyp_beta&can=2&q=#makechanges
Comment #9
Posted on May 20, 2012 by Happy ElephantIt seems that the new pyp_beta(2.11.23) does not include this --quick_output feature, right?
Comment #10
Posted on May 20, 2012 by Grumpy LionQuick output mode is now on by default unless using one of the list operators(pp, spp, fpp), so we removed the flag. Cheers, t
Comment #11
Posted on May 20, 2012 by Happy ElephantComment deleted
Comment #12
Posted on May 20, 2012 by Grumpy LionThat's weird. You should see immediate output without the redirection. Is that your exact command? Make sure you are running pyp_beta. Let me know if the older version with the flag runs faster. Thanks,
T
Comment #13
Posted on May 22, 2012 by Grumpy LionHi, are you still seeing these issues with pyp_beta?
Thanks, t
Comment #14
Posted on May 22, 2012 by Happy ElephantSorry for late response. I tested pyp_beta_2.11.1 and 2.11.5 using the --quick_output option, and both worked for large file without first loading the file.
Comment #15
Posted on May 22, 2012 by Grumpy LionOk, thanks for checking that out. We'll look into this...that's a key feature.
t
Comment #16
Posted on Jun 2, 2012 by Grumpy LionHi, I think I've found the problem...could you try this version and let me know how it goes?
Thanks again for your help, it's a good suggestion, and I think it feels more responsive when running simple commands. It's got a fairly complex switching routine, so it's taking a while to iron out the bugs.
t
- pyp_beta 104.75KB
Comment #17
Posted on Jun 2, 2012 by Happy ElephantWith the new pyp_beta, I can get output without loading whole file into memory. However, it seems that pyp_beta is quite slow for large file processing.
I tested the performance for awk and pyp, using the following simple example. The file(article_categories_en.nt, around 2G) I use is downloaded from DBpedia, which contains about ten million lines.
/usr/bin/time -o awk.time cat article_categories_en.nt | awk '{print $1,$3}' > test.awk
/usr/bin/time -o pyp.time cat article_categories_en.nt | ./pyp_beta 'w[1],w[3]' > test.pyp
I am not sure if I do thing right(I am new both to pyp and awk). Using the above commands, awk takes around 13s to produce the output file, which is around 1.5G. For pyp_beta, half an hour passed but it is still running, and only produces about 30M output file.
Though this case is too easy to show the true power of pyp, it seems that the performance issue is really annoying.
Comment #18
Posted on Jun 6, 2012 by Grumpy LionI assume you see this level of performance with the earlier pyp betas as well. Unfortunately, I think we'll have to go to fully compiled code to get the level of performance you need. Most users of pyp are working on much smaller data sets. Thanks for your help testing....hopefully we will get this compiled at some point.
Status: Fixed
Labels:
Type-Defect
Priority-Medium