My favorites | Sign in
pyp
Project Home Downloads Wiki Issues Source
Project Information
Members
Links

The Pyed Piper


New Multi-stream Support Added 10.07.11: please see bottom of page


Piping Python Through Pipes

Pyp is a linux command line text manipulation tool similar to awk or sed, but which uses standard python string and list methods as well as custom functions evolved to generate fast results in an intense production environment. Pyed Pyper was developed at Sony Pictures Imageworks to facilitate the construction of complex image manipulation "one-liner" commands during visual effects work on Alice in Wonderland, Green Lantern, and the upcoming The Amazing Spiderman.

Because pyp employs it's own internal piping syntax ("|") similar to unix pipes, complex operations can be proceduralized by feeding the output of one python command to the input of the next. This greatly simplifies the generation and troubleshooting of multistep operations without the use of temporary variables or nested parentheses. In practice, the ability to easily construct complicated command sequences can largely replace "for each" loops on the command line, thus significantly speeding up work-flow using standard unix command recycling.

pyp output has been optimized for typical production scenarios. For example, if text is broken up into an array using the "split()" method, the output will be automatically numbered by field making selecting a particular field trivial. Numerous other conveniences have been included, such as an accessible history of all inter-pipe sub-results, an ability to perform mathematical operations, and a complement of variables based on common metacharcter split/join operations.

For power users, commands can be easily saved and recalled from disk as macros, providing an alternative to quick and dirty scripting. For the truly advanced user, additional methods can be added to the pyp class via a config file, allowing tight integration with larger facilities data structures or custom toolsets.


A Quick Tour

The simplest pyp example shows how python string methods can be used easily on the command line. For example, to split up the different columns of a linux long listing, we just use the split method with pyp's line-by-line variable "p"

ls -l | pyp "p.split()"

we can then use standard python indexing to select the column. For example, to select the last column, we can just use this:

ls -l | pyp "p.split()[-1]"

Any other python string methods can be used; for example p.lower() will make everything lowercase.

For a more complicated example, we take a linux long listing, capture every other of the 5th through the 10th lines, keep username and file name fields, replace "hello" with "goodbye", capitalize the first letter of every word, and then add the text "is splendid" to the end:

ls -l | pyp "pp[5:11:2] | whitespace[2], w[-1] | p.replace('hello','goodbye') | p.title(),'is splendid'"

This uses pyp's built-in line-by-line and entire input variables (p and pp), as well as the variable whitespace and it's shortcut w, which both represent a list based on splitting each line on whitespace (whitespace = w = p.split()).

The other functions and selection techniques are all standard python. Notice the pipes ("|") are inside the pyp command.

We can then save this as a macro to disk using the flag --macro_save splendid_example

The next time we need to perform this operation, we can simply use this:

ls -l | pyp splendid_example

What's New in Version 2.10


A Quick Review of File and Second Streams

Typically, data is piped into pyp just like any other unix utility through STD-IN. There are, however, two other completely independent ways to get data into pyp. These other streams can then be manipulated separately as well as combined in various ways with the main stream.

File inputs can be specified using the --text_file flag, and are referred to on a line-by line-basis using the variable 'fp'. For example, to print out the std-in line-by-line next to each line in a file use this:

pyp "p + fp" --text_file example.txt  

Second stream inputs, on the other hand, are anything after the pyp command that is not associated with an option flag. This can then be accessed separately from the primary stream by using the variable 'sp'. To print out the std-in line-by-line next to each sp_example, use this:

pyp "p + sp" sp_example1 sp_example2 sp_example3

New File and Second Stream List Variables

List operations can now be performed on file inputs and second stream inputs using the variables spp and fpp, respectively. These essentially treat each input as a list, so you can use standard python list methods as well as any of the specialized pyp list methods. Once modified, these changes will "stick", so future references to the list will use the new list.

For example to sort a file input use:

fpp.sort()                                                                                

Once this operation takes place, the sorted fpp will be used for all future operations, such as referring to the file input line-by-line using fp.

If you need arbitrary strings added to your list, you can just use simple python list additions:

fpp + ['last list']

You can also add these inputs to the std-in stream using like this:

pp+fpp

If pp is 10 lines, and fpp is 10 line, this will result in a new pp stream of 20 lines, with the first 10 being from pp and the last 10 being from fpp. fpp will remain untouched; only pp will change with this operation.

Of course, you can trim these to your needs using standard python list selection techniques:

pp[0:5]+ fpp[0:5]

This will result in a new composite input stream of 10 lines.

Keep in mind that the length of fpp and spp is trimmed to reflect that of std-in. If you need to see more of your file second stream input, you can extend your std-in stream simply:

pp+['']*10

will add 10 blank lines to std-in, and thus reveal another 10 lines of fpp if available.

Also, there are a few useful python math functions that work on lists of integers or floats like sum, min, and max. For example, to add up all of the integers in the last column input:

whitespace[-1] | int(p) | sum(pp)                                        

Other New Features

  • p.ext string attribute added for file extension
  • PypConfig.py now supports customizable string attributes
  • keep() and lose() now work on split up lines

Fixes

  • list operations are correctly numbered after a filtering op
  • terminal color shift now fixed after list op
  • --no_color just removes color instead of printing raw output
  • PypConfig.py has been speed optimized


What's New in Version 2.04


New Functions

The following string method allows pattern matching using a REGEX:

p.re(REGEX)returns portion of string that matches REGEX regular expression

echo 123456 | pyp "p.re('.3.')"
>234

The following pyp functions provide REGEX-based line filtering capabilities:

rekeep(REGEX)keep all lines that match REGEX regular expression
relose(REGEX)lose all lines that match REGEX regular expression

The above functions have the respective shortcuts rek(REGEX) and rel(REGEX)


These next three functions provide an alternative procedural approach to regex-like operations by allowing you to pick out specific groups of letters, digits, or punctuation. For example:

echo abc123def456ghi789 | pyp "p.letters()"
>[[0]abc[1]def[2]ghi]

p.letters()returns array of contiguous letters in p
p.digits()returns array of contiguous numbers in p.
p.punctuation() returns array of contiguous punctuation in p


The next functions are just some nice cleanup tools:

p.kill(STR1, STR2,...)now takes multiple arguments. all STR* will be removed from p.
p.clean()replaces "bad" metacharacters with underscores


New Variables

letters abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
digits 0123456789
punctuation !"#$%&'()*+,-./:;<=>?@[\]^_{|}~`

letters is great when used with the line counter "n"...for example, to name things a,b,c,d just use "letters[n]"


New Flag

--keep_false Normally pyp filters out anything that tests False ([],'',0,etc). With this flag, pyp just prints out a blank line if something tests False.


Fixes

  • The counter variable ("n") now starts at 0 not 1 to keep things more pythonic.
  • pp.uniq and pp.oneline are now methods, not attributes, so use pp.uniq() and pp.oneline() instead
  • history variable has un-nested entries.
  • various minor bugs fixed
Powered by Google Project Hosting