My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
NewModules  
Implement your own modules
Featured
Updated Today (3 hours ago) by mcra...@gmail.com

Table of contents

Implement your own modules

Plowshare is designed with modularity in mind, so it should be easy for other programmers to add new modules. Study the code of any of the existing modules (i.e. 2shared) and create your own.

Some hosters are exporting a public API (formalized way for downloading or uploading), if it is available, it can save you lots of time calling this API, instead of simulating a web browser. For example: HotFile.

Script template

Each module implements services for one sharing site:

  • anonymous download
  • free/premium account download
  • anonymous upload (if allowed from host)
  • free/premium account upload
  • free/premium account remote upload (if available from host)
  • delete or kill url (anonymous or not)
  • shared folder (and sub-folders) list (if available from host)

The module must declare the following global variables:

MODULE_XXX_REGEXP_URL

Depending module features, some additional variables should also be declared:

MODULE_XXX_DOWNLOAD_OPTIONS
MODULE_XXX_DOWNLOAD_RESUME
MODULE_XXX_DOWNLOAD_FINAL_LINK_NEEDS_COOKIE

MODULE_XXX_UPLOAD_OPTIONS
MODULE_XXX_UPLOAD_REMOTE_SUPPORT

MODULE_XXX_DELETE_OPTIONS

MODULE_XXX_LIST_OPTIONS

Where XXX is the name of module (uppercase). No other global variable declaration is allowed.

Module must export one to four entries point:

  • xxx_download()
  • xxx_upload()
  • xxx_delete()
  • xxx_list()

Downloading function

Prototype is:

xxx_download() {
    eval "$(process_options xxx "$MODULE_XXX_DOWNLOAD_OPTIONS" "$@")"

    local COOKIEFILE=$1
    local URL=$2

    ...
}

Notes:

  • xxx is the name of the plugin: src/modules/xxx.sh.
  • xxx must not contain points, use underscores instead.
  • $@ is modified after call to process_options. So you must keep it as first call in the download function. However if $MODULE_XXX_DOWNLOAD_OPTION is empty it's not required.
  • Never call curl_with_log function here, use curl.

Arguments:

  • $1: cookie file (empty content at start, use it with curl)
  • $2: URL string (for example http://x7.to/fwupja)

Warning: If function does not need a cookie file, do not delete cookie file provided as argument, plowdown will take care of this.

When a link is correct, function should return 0 and echo one or two arguments, corresponding to file URL and filename:

echo "$FILE_URL"
echo "$FILENAME"

$FILENAME can be empty, or even not echoed at all. If so, plowdown will guess filename from provided $FILE_URL.

If cookie file is required for final download MODULE_XXX_DOWNLOAD_FINAL_LINK_NEEDS_COOKIE must be set to yes.

File URL must return the final link (that's it, a link that return a 200 HTTP code, without redirection). Use curl -I and grep_http_header_location when necessary.

Note: $FILE_URL will be encoded right after. So don't bother about weird characters. For example: spaces chars will be translated to %20 for you.

Possible return values

Module can return the following codes:

  • 0: Everything is ok (arguments have to be echoed, see below). When plowdown is invoked with -c/--check-link option, it's means that link is alive.
  • $ERR_FATAL: Unexpected result (upstream site updated, etc).
  • $ERR_LOGIN_FAILED: Correct login/password argument is required.
  • $ERR_LINK_TEMP_UNAVAILABLE: Link alive but temporarily unavailable.
  • $ERR_LINK_PASSWORD_REQUIRED: Link alive but requires a password (password protected link).
  • $ERR_LINK_NEED_PERMISSIONS: Link alive but requires some authentication (private or premium link).
  • $ERR_LINK_DEAD: Link is dead (we must be sure of that). Each download function should return this value at least one time.
  • $ERR_SIZE_LIMIT_EXCEEDED: Can't download link because file is too big (need permissions, probably need to be premium).

Additional error codes (returned by plowdown only, module download function should not return these):

  • $ERR_NOMODULE: No module available for provided link. Hoster is not supported yet!
  • $ERR_NETWORK: Specific network error (socket reset, curl, etc).
  • $ERR_SYSTEM: System failure (missing executable, local filesystem, wrong behavior, etc).
  • $ERR_CAPTCHA: Captcha solving failure.
  • $ERR_MAX_WAIT_REACHED: Countdown timeout (see -t/--timeout command line option).
  • $ERR_MAX_TRIES_REACHED: Max tries reached (see -r/--max-retries command line option).

Guidelines

  • If hoster asks to try again later (and you don't know how much time to wait): download function must return $ERR_LINK_TEMP_UNAVAILABLE.
  • If hoster asks to try again later (and you do know how much time to wait): download function must echo wait time (in seconds) and return $ERR_LINK_TEMP_UNAVAILABLE.
  • Respect time waits even if the download seems to work without them. Don't hammer website!
  • Try to force english language in the website (usually using a cookie), if your are going to parse human messages (it's better to parse HTML nodes, though).
  • If you provide premium download, bad login must lead to an error ($ERR_LOGIN_FAILED). No fallout to anonymous download must be made (even if remote web site accepts it).

Uploading function

Prototype is:

xxx_upload() {
    eval "$(process_options xxx "$MODULE_XXX_UPLOAD_OPTIONS" "$@")"

    local COOKIEFILE=$1
    local FILE=$2
    local DESTFILE=$3

    ...

    PAGE=$(curl_with_log ...) || return

    ...
}

Notes:

  • xxx is the name of the plugin: src/modules/xxx.sh.
  • xxx must not contain points, use underscores instead.
  • $@ is modified after call to process_options. So you must keep it as first call in the upload function. However if $MODULE_XXX_UPLOAD_OPTION is empty it's not required.
  • Use curl_with_log function only one time for the file upload (it's quite conveniant to see progress), otherwise use simply curl.

Arguments:

  • $1: cookie file (empty content at start, use it with curl)
  • $2: local filename (with full path) to upload or (remote) URL
  • $3: remote filename (no path)

Warning: If function does not need a cookie file, do not delete cookie file provided as argument, plowup will take care of this.

When requested file has been successfully uploaded, function should return 0 and echo one or three lines.

echo "$DL_URL"
echo "$DEL_URL"
echo "$ADMIN_URL_OR_CODE"

$DEL_URL and $ADMIN_URL_OR_CODE are optional (can be empty or not echoed at all).

Example1 (seen in depositfiles module):

echo "$DL_LINK"
echo "$DEL_LINK"

Example2 (seen in 2shared module):

echo "$FILE_URL"
echo
echo "$FILE_ADMIN"

Possible return values

Module can return the following codes:

  • 0: Success. File successfully uploaded.
  • $ERR_FATAL: Unexpected result (upstream site updated, etc).
  • $ERR_LINK_NEED_PERMISSIONS: Authentication required (for example: anonymous users can't do remote upload).
  • $ERR_LINK_TEMP_UNAVAILABLE: Upload service seems temporarily unavailable from upstream.
  • $ERR_SIZE_LIMIT_EXCEEDED: Can't upload too big file (need permissions, probably need to be premium).
  • $ERR_LOGIN_FAILED: Correct login/password argument is required.

Additional error codes (returned by plowup only, module upload function should not return these):

  • $ERR_NOMODULE: Specified module does not exist or is not supported.
  • $ERR_NETWORK: Specific network error (socket reset, curl, etc).
  • $ERR_SYSTEM: System failure (missing executable, local filesystem, wrong behavior, etc).
  • $ERR_MAX_TRIES_REACHED: Max tries reached (see -r/--max-retries command line option).

Guidelines

  • Remember that $2 can also be a remote file. It should be checked with match_remote_url. Most of the time, remote upload feature is only available for premium users. If module do not support this put on top of file: MODULE_xxx_UPLOAD_REMOTE_SUPPORT=no`
  • Upload file size if usually limited (can be quite low for anonymous upload). Dealing with it could be nice for user! For example:
  • MAX_SIZE=... # hardcoded value or parse it in html page (if possible)
    SIZE=$(get_filesize "$FILE")
    if [ $SIZE -gt $MAX_SIZE ]; then
        log_debug "file is bigger than $MAX_SIZE"
        return $ERR_SIZE_LIMIT_EXCEEDED
    fi

Deleting function

Prototype is:

xxx_delete() {
    eval "$(process_options xxx "$MODULE_XXX_DELETE_OPTIONS" "$@")"

    local COOKIEFILE=$1
    local URL=$2

    ...
}

Notes:

  • xxx is the name of the plugin: src/modules/xxx.sh
  • xxx must not contain points, use underscores instead
  • $@ is modified after call to process_options. So you must keep it as first call in the delete function. However if $MODULE_XXX_DELETE_OPTION is empty it's not required.
  • Never call curl_with_log function here, use curl.

Argument:

  • $1: cookie file (empty content at start, use it with curl)
  • $2: kill/admin URL string

Warning: If function does not need a cookie file, do not delete cookie file provided as argument, plowdel will take care of this.

There is not output for this function. When file has been successfully deleted, function should return 0.

Possible return values

Module can return the following codes:

  • 0: Success. File successfully deleted.
  • $ERR_FATAL: Unexpected result (upstream site updated, etc).
  • $ERR_LOGIN_FAILED: Authentication failed (bad login/password).
  • $ERR_LINK_NEED_PERMISSIONS: Authentication required (anonymous users can't delete files).
  • $ERR_LINK_DEAD: Link is dead. File has been previously deleted.

Additional error codes (returned by plowdel only, module delete function should not return these):

  • $ERR_NOMODULE: No module available for provided link.
  • $ERR_NETWORK: Specific network error (socket reset, curl, etc).

Guidelines

  • On success operation (return 0), don't print a message; plowdel will log_notice for you.

Listing function

Prototype is:

xxx_list() {
    eval "$(process_options xxx "$MODULE_XXX_LIST_OPTIONS" "$@")"

    local URL=$1
    local RECURSE=${2:-"0"}

    ...
}

Notes:

  • xxx is the name of the plugin: src/modules/xxx.sh
  • xxx must not contain points, use underscores instead
  • Never call curl_with_log function here, use curl.

Arguments:

  • $1: list URL (aka root folder URL)
  • $2: list link and recurse subfolders (if any). If $2 is empty string, the option is not selected.

As result, function must echo download links (one URL per line).

Possible return values

Module can return the following codes:

  • 0: Success. Folder contain one or several files.
  • $ERR_FATAL: Unexpected content (not a folder, parsing error, etc).
  • $ERR_LINK_PASSWORD_REQUIRED: Folder is password protected.
  • $ERR_LINK_DEAD: Folder has been deleted or does not exist or is empty.

Additional error codes (returned by plowlist only, module list function should not return these):

  • $ERR_NOMODULE: No module available for provided link.
  • $ERR_NETWORK: Specific network error (socket reset, curl, etc).

Guidelines

  • You should notify with a log_debug message if module doesn't support recursive subfolders option. For example in 4shared module:
  • test "$2" && log_debug "recursive flag is not supported"
  • When recursing sub folders, don't echo folder URL.
  • When recurse subfolders option is enabled: $ERR_LINK_DEAD means that there is no file in all folders.
  • When recurse subfolders option is disabled: $ERR_LINK_DEAD means that there is no file in the root folder. There might be files in sub folders.

Output debug messages (stderr)

Do not use echo which is reserved for function return value(s). Use log_debug() or log_error(). You can use -vN command line option switch to change debug verbosity.

Note: An intermediate verbosity level exists: log_notice(), it is reserved to core functions, do not use it inside modules.

Module arguments

If MODULE_XXX_DOWNLOAD_OPTIONS / MODULE_XXX_UPLOAD_OPTIONS / MODULE_XXX_DELETE_OPTIONS or MODULE_XXX_LIST_OPTIONS is not empty, you must process arguments at the beginning of the function using:

eval "$(process_options xxx "$MODULE_XXX_LIST_OPTIONS" "$@")"

This will modify $@ and affect values (if user specified them) to module option.

Example:

Assuming module source contain:

MODULE_XXX_DELETE_OPTIONS="
AUTH,a:,auth:,USER:PASSWORD,User account"

Assuming user is invoking plowdel with an account:

$ plowdel -a 'user:password' 'http://www.sharing-site.com/?delete=12D45G5'

Module function xxx_delete() will be called with the following arguments:

$1='-a'
$2='user:password'
$3='http://www.sharing-site.com/?delete=12D45G5'

after process_options call:

$1='http://www.sharing-site.com/?delete=12D45G5'
AUTH='user:password'

curl function

This is probably the most important command in plowshare API set. This wrapper function is calling curl real binary (let's call it true-curl)

Arguments:

  • $1 ... $n : true-curl command-line arguments
  • $?: 0 for success; $ERR_NETWORK, $ERR_SYSTEM

Note: curl_with_log is calling curl but force verbose level to 3. This is a specific usage for module upload function (should be called one time only).

It's a good habit to always append || return for error handling.

Examples:

PAGE1=$(curl "http://www.google.com") || return

# Get remote content and take cookies (if any)
PAGE2=$(curl -c "$COOKIE_FILE" "$URL") || return

# Get remote content, provides and append cookie entries
PAGE3=$(curl -c "$COOKIE_FILE" -b 'lang=en' "$URL") || return
PAGE4=$(curl -c "$COOKIE_FILE" -b "$COOKIE_FILE" "$URL") || return

PAGE5=$(curl "${URL}?param=1") || return
# or
PAGE5=$(curl --get --data 'param=1' "$URL") || return

Notes:

  • curl will add a valid User-Agent for you.
  • curl exit codes are mapped to plowshare error codes. Human debug message have been added too.
  • curl are mapping implicitly plowdown (or plowup) command-line switches (--interface, --max-rate, ...)

Temporary files are deleted in case of error

First example using -H/--dump-headers:

HEADERS=$(create_tempfile) || return
HTML=$(curl -H "$HEADERS" http://...) || return
rm -f "$HEADERS"

If something goes wrong in curl (network issue or anything else), $HEADERS will be deleted for you.

Remember, it's only if an error occurs. On curl's success nothing is deleted (as expected).

Another classic example if using -o/--output:

CAPTCHA_URL='http://...'
CAPTCHA_IMG=$(create_tempfile '.png') || return
curl -o "$CAPTCHA_IMG" "$CAPTCHA_URL" || return
...
rm -f "$CAPTCHA_IMG"

If something append when retrieving captcha image, curl will delete temporary file for you.

Split long data string

DATA="action=validate&uid=123456&recaptcha_challenge_field=$CHALLENGE&recaptcha_response_field=$WORD"
RESULT=$(curl -b "$COOKIE_FILE" --data "$DATA" "$URL") || return

Consider passing several -d/--data argument instead of one (order is not important).

RESULT=$(curl -b "$COOKIE_FILE" -d 'action=validate' \
    -d "uid=123456" \
    -d "recaptcha_challenge_field=$CHALLENGE" \
    -d "recaptcha_response_field=$WORD" \
    "$URL") || return

Is better for maintenance.

Auxiliar functions

You can see a full list of plowshare public API on NewModules2 wiki page.

core.sh script provides usual auxiliar functions.

Do not use But use
basename basename_file
grep -o "^http://[^/]*" basename_url (see examples below)
sleep wait (must always be ORed with return keyword)
grep or grep -q match
grep -i or grep -iq matchi
sed, awk, perl parse, parse_all, parse_quiet, parse_last or replace
head -n1, tail -n1 first_line, last_line
mktemp, tempfile create_tempfile
tr '[A-Z]' '[a-z]' lowercase
tr '[a-z]' '[A-Z]' uppercase
sed ... strip (delete leading and trailing spaces, tabs), delete_last_line
js detect_javascript and javascript
perl detect_perl (useful for 3rd-part scripts)
which or command -v check_exec (in case of external dependency)
stat -c %s get_filesize
$RANDOM or $$ random
md5sum md5

Goal here, is not calling non portable commands in modules.

Function: post_login

It is a useful function for registered accounts because ID information is stored inside cookie. This function will send the HTML form for you, It takes 4 or 5 arguments.

Arguments:

  • $1: authentication string 'username:password' (password can contain semicolons)
  • $2: cookie file (system existing file)
  • $3: string to post (can contain keywords: $USER and $PASSWORD)
  • $4: URL
  • $5..$n (optional): Additional curl arguments
  • stdin: input data (text)

Example:

# comes from command line
AUTH="mylogin:mypassword"

# important: notice simple quote, $USER and $PASSWORD must not be interpreted.
LOGIN_DATA='login=1&redir=1&username=$USER&password=$PASSWORD'
LOGIN_URL="https://xxx.com/login.php"
  
# or simply use $(create_tempfile) 
COOKIES=/tmp/my_cookie_file
 
post_login "$AUTH" "COOKIES" "$LOGIN_DATA" "$LOGIN_URL" >/dev/null

Results:

  • $?: 0 for success; $ERR_NETWORK, $ERR_LOGIN_FAILED for error (no cookie return)
  • stdout: HTML result of POST request

A common usage is (snippet taken from filesonic module):

LOGIN_RESULT=$(post_login "$AUTH" "$COOKIE_FILE" "$LOGIN_DATA" 'http:///www.fileserve.com/login.php') || return

If no password is provided, post_login will prompt for one.

Warning: Having $?=0 does not mean that your account is valid, it just means that the request (in a HTTP protocol point of view) have been successful. For detecting bad login/password, you'll have to parse returned HTML content or sometimes cookie file.

Note: Sometimes, parsing LOGIN_RESULT can be useful to distinguish free account from premium account. Sometimes parsing cookie (looking for specific entry in it) can help too.

Use case 1 (seen in netload.in module)

An empty $LOGIN_RESULT is not necessarily an error. You can get for example a HTTP redirection. You could eventually follow this redirection by giving '-L' option to curl:

LOGIN_RESULT=$(post_login "$AUTH" "$COOKIE_FILE" "$LOGIN_DATA" "$BASEURL/login.php" -L) || return

Use case 2 (seen in mediafire module)

You already have valid entries in $COOKIEFILE (language for example) and you want keeping them.

LOGIN_RESULT=$(post_login "$AUTH_FREE" "$COOKIEFILE" "$LOGIN_DATA" \
    "$BASE_URL/dynamic/login.php?popup=1" -b "$COOKIEFILE") || return

Without this additional -b "$COOKIEFILE" given to curl, cookie file would be overwritten.

Functions: match and matchi

Arguments:

  • $1: match regexp (like grep)
  • $2: input data (text)

Results:

  • $?: 0 for success; not null any error
  • stdout: nothing!

'I' letter stand for case-insensitive match.

match does not use sed command, so you don't have to escape "/" (slash) character.

Regexp are basic posix. Reserved characters (to escape) are: . * [ ].

Coding convention is to use the shortest write:

match 'foo' "$HTML_PAGE" && ...        // right
$(match 'foo' "$HTML_PAGE") && ...     // wrong (useless subshell creation)
match '\(foo\)' "$HTML_PAGE" && ...    // wrong (useless parenthesis)

if (! match 'You are ' "$HTML"); then  // wrong (useless subshell creation)
    ...
fi

Typical use:

if ! match '/js/myfiles\.php/' "$PAGE"; then
    log_error "not a folder"
    return $ERR_FATAL
fi

Simple examples:

match '[0-9][0-9]\+' 'Wait 19 seconds'       // true
match '[0-9][0-9]\+' 'Wait 9 seconds'        // false
match 'times\?' 'One time ago'               // true
match 's/n' 'yes/no'                         // true
match '(euros)' '3.5 (euros)'                // true
match '\[euros\]' '3.5 [euros]'              // true

More examples (seen in modules):

match '^http://download' "$LOCATION"   // ^ matches begining of line
match 'errno=999$' "$LOCATION"         // $ matches end of line
match '.*/#!index|' "$URL"             // . means any character
match 'File \(deleted\|not found\|ID invalid\)' "$ERROR"

Functions: parse, parse_all, parse_quiet, parse_all_quiet and parse_last

The first function will return first match, second one will return all matches (multiline result). sed command in internally used here.

Arguments:

  • $1: regexp (lines to stop)
  • $2: catch regexp (enclose with \( \) to retrieve match)
  • stdin: input data (text)

Results:

  • $?: 0 on success or $ERR_FATAL (non matching or empty result)
  • stdout: parsed content (non null string)

Regexp are basic posix. Reserved characters (to escape) are: . * / [ ].

Examples:

ID=$(echo "$HTML_PAGE" | parse 'name="freeaccountid"' 'value="\([[:digit:]]*\)"')
HOSTERS=$(echo "$FORM" | parse_all 'checked' '">\([^<]*\)<br')
MSG=$(echo "$RESPONSE" | parse_quiet "ERROR:" "ERROR:[[:space:]]*\(.*\)")

FIXME: Add examples with ^ and $

Should I use parse or parse_quiet?

Use xxx_quiet functions when parsing failure is a normal behavior, for example, parsing an optional value.

Typical use:

OPT_RESULT=$(echo "$HTML_PAGE" | parse_quiet 'id="completed"' '">\([^<]*\)<\/font>')

If you actually require a result, do not put it quiet (you'll get a sed error message if parse fails):

Typical use:

WAIT_TIME=$(echo "$HTML_PAGE" | parse '^[[:space:]]*count=' "count=\([[:digit:]]\+\);") || return

Note: Don't use these functions for HTML parsing. Consider using parse_tag and parse_attr functions family (see below #Parsing_HTML_markers and #Parsing_HTML_attributes).

Functions: parse_line_after and parse_line_after_all

Arguments:

  • $1: filter regexp (lines to stop)
  • $2: catch regexp (enclose with \( \) to retrieve match)
  • $3 (optional): number of line to skip (default is 1).
  • stdin: input data (text)

Results:

  • $?: 0 on success or $ERR_FATAL (non matching or empty result)
  • stdout: parsed content (non null string)

This is useful when filter regexp and catch regexp are not on the same line. Like all parse* functions, sed command in internally used here.

This will be very useful if you want to grep this:

<div class="dl_filename">
FooBar.tar.bz2</div>

We can get the right line with filtering with dl_filename and apply your filename regexp on the second line (the line after). This will give:

echo "$PAGE" | parse_line_after 'dl_filename' '\([^<]*\)'

Another example:

function js_fff() {
    R4z5sjkNo = "http://...";
    DelayTime = 60;
...

Get URL with:

DL_LINK=$(echo "$PAGE" | parse_line_after 'js_fff' '"\([^"]\+\)";') || return

Get counter value with (this is some kind of parse_line_after_after):

COUNT=$(echo "$PAGE" | parse_line_after 'js_fff' '=[[:space:]]*\([[:digit:]]\+\)' 2) || return

Function: basename_url

Get basename (hostname) of an URL.

A=$(basename_url 'http://code.google.com/p/plowshare/wiki/NewModules'
# result: http://code.google.com
B=$(basename_url 'http://code.google.com/'
# result: http://code.google.com
C=$(basename_url 'abc'
# result: abc

Functions: grep_http_header_location, grep_http_header_location_quiet

Argument:

  • stdin: data (HTTP headers)

Result:

  • $?: 0 on success or $ERR_FATAL (non matching or empty string)
  • stdout: parsed header (non null string)

If you think you reached the final url (let's call it $FINAL_URL) for download, and when you curl it (with -I/--head option), you got some HTTP answer like this:

HTTP/1.1 301 Moved Permanently
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Location: /download/123/5687/final_filename.xyz
Content-type: text/html
Content-Length: 0
Connection: close
Date: Sun, 17 Jan 2010 14:34:47 GMT
Server: Apache

Use grep_http_header_location to deal with this redirection. Have a look at sendspace module:

HOST=$(basename_url "$FINAL_URL")
PATH=$(curl -I "$FINAL_URL" | grep_http_header_location) || return
echo "${HOST}${PATH}"

Another example with absolute uri (comes from euroshare.eu):

HTTP/1.1 302 Found
Date: Sat, 10 Mar 2012 11:14:31 GMT
Server: Apache/2.2.16 (Debian)
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Set-Cookie: sid=61bu6nt3kkh9nsk92mg7otg501; expires=Sun, 11-Mar-2012 11:14:31 GMT; path=/
Location: http://s1.euroshare.eu/download/3598184/aXa2YWy3ytUhu3uVUsAQEgUzUDUseje3/5344113/myfile.zip
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: x-requested-with
Access-Control-Allow-Headers: x-file-name
Access-Control-Allow-Headers: content-type
Vary: Accept-Encoding
Content-Type: text/html
FILE_URL=$(curl -I "$FINAL_URL" | grep_http_header_location) || return
echo "$FILE_URL"

Note: Like other *_quiet functions, grep_http_header_location_quiet is silent and do always return 0. Use this only on dedicated case. For example:

FILE_URL=$(echo "$HTML_PAGE" | grep_http_header_location_quiet) || return
if [ -z "$FILE_URL" ]; then
    ... # not premium

Function: grep_http_header_content_disposition

Argument:

  • stdin: data (HTTP headers)

Result:

  • $?: 0 on success or $ERR_FATAL (non matching or empty string)
  • stdout: parsed filename (non null string)

Sharing websites often return their files as an attachment. curl doesn't care about Content-Disposition:. So, it will not parse this HTTP header but keeps url as name reference (see -O option documentation).

$ curl http://p123.share-site.com/download/dl.php?id=123456456
# saved filename will be: "dl.php?id=123456456"

The reason for that, is that link can have multiple attachments. Note: This is a difference between curl and wget.

Note: This is not true anymore. Since curl 7.20.0, -J/--remote-header-name option has been added (you must combine it with -O/--remote-name). Plowshare does not use this for now.

Have a look at divshare module:

FILE_NAME=$(curl -I "$FILE_URL" | grep_http_header_content_disposition) || return

Before plowdown core script make the final HTTP GET request, module is doing a HTTP HEAD request in order to parse attachment header and get filename.

$ curl -I http://p123.share-site.com/download/dl.php?id=123456456
HTTP/1.0 200 OK
Date: Sun, 28 Feb 2010 11:41:50 GMT
Server: Apache
Last-Modified: Mon, 12 Oct 2009 10:04:20 GMT
ETag: 9852859-16341905311255341860
Cache-Control: max-age=30
Content-Disposition: attachment; filename="kop_standard.pdf"
Accept-Ranges: bytes
Content-Length: 412848
Vary: User-Agent
Keep-Alive: timeout=300, max=100
Connection: keep-alive
Content-Type: application/octet-stream

Notice that some sharing sites does not an allow HTTP HEAD requests. Restricting web server is maybe a security concern?

There is a possible workaround: HTTP 1.1 protocol allow to make to HTTP GET request and specify a byte range.

FILE_NAME=$(curl -i -r 0-99 "$FILE_URL" | grep_http_header_content_disposition) || return

This is not very classy, but this can work, except if sharing site only allow one (and only one) HTTP request to that final URL (uploaded.to for example). In that case you couldn't get attachment filename.

Function: grep_http_header_content_location

Argument:

  • stdin: data (HTTP headers)

Result:

  • $?: 0 on success or $ERR_FATAL (non matching or empty string)
  • stdout: parsed content (non null string)

Get HTTP "Content-Location:" value.

Function: grep_http_header_content_type

Argument:

  • stdin: data (HTTP headers)

Result:

  • $?: 0 on success or $ERR_FATAL (non matching or empty string)
  • stdout: parsed content (non null string)

Get HTTP "Content-Type:" value.

Functions: parse_cookie, parse_cookie_quiet

Arguments:

  • $1: entry name
  • stdin: data (netscape/mozilla cookie file format)

Result:

  • $?: 0 on success or $ERR_FATAL (non matching or empty string)
  • stdout: parsed content (non null string)

This is often used to get account settings. Sometimes, for premium account, remote site adds an extra key in cookie file. So it can be convenient to differ free account from premium account.

LOGIN_ID=$(parse_cookie 'Login' < "$COOKIEFILE") || return
PASS_HASH=$(parse_cookie 'Password' < "$COOKIEFILE") || return
# At this point You are sure that $LOGIN_ID and $PASS_HASH are valid (non empty)

Note: Like other *_quiet functions, parse_cookie_quiet is silent and do always return 0. Use this only on dedicated case. For example:

USERNAME=$(parse_cookie_quiet 'login' < "$COOKIEFILE")
if [ -z "$USERNAME" ]; then

    ... # invalid account

    return $ERR_LOGIN_FAILED
fi

Parsing HTML markers

Arguments:

  • $1: filtering regexp (lines to stop, put '.' to stop at each line)
  • $2: tag name. This is case sensitive.
  • stdin: data (HTML, XML)

$1 can be omitted. Only tag name is passed single argument.

Result:

  • $?: 0 on success or $ERR_FATAL (non matching or empty marker)
  • stdout: parsed content (non null string)

Name Usage example
parse_tag T=$(echo "$LINE" | parse_tag title)
parse_tag_quiet Same as parse_tag but don't print on parsing error
parse_all_tag n/a
parse_all_tag_quiet Same as parse_all_tag but don't print on parsing error

The all functions are for multiline content, one tag is parsed per line.

Important: If you have several matching tags on the same line, the first one is taken.

Remember that this is line oriented, if beginning tag and ending are not on the same line, it won't work. It's not perfect, but for now, it covers all our need.

Examples:

LINE='... <a href="link1">Link number 1</a> <a href="javascript:;">Link number 2</a>'
LINK1=$(echo "$LINE" | parse_tag a) || return               # 1st link returned
LINE='... <b></b> ...'
CONTENT=$(echo "$LINE" | parse_tag b) || return             # error, return called
# Nested elements: take the deepest one!
WAIT_MSG='<span id="foo">Wait <span id="bar">30</span> seconds</span>'
WAIT_TIME=$(echo "$WAIT_MSG" | parse_tag span) || return    # 30

Note: "parse_tag b" is equivalent to "parse_tag . b" and "parse_tag b b".

Parsing HTML attributes

Arguments:

  • $1: filtering regexp (lines to stop, put '.' to stop at each line)
  • $2: attribute name
  • stdin: data (HTML, XML)

Result:

  • $?: 0 on success or $ERR_FATAL (non matching or empty attribute)
  • stdout: parsed content (non null string)

Name Usage example
parse_attr LINK=$(echo "$IMG" | parse_attr 'img' 'href')
parse_attr_quiet Same as parse_attr but don't print on parsing error
parse_all_attr LINKS=$(echo "$PAGE" | parse_all_attr 'Link_[[:digit:]]' 'href')
parse_all_attr_quiet Same as parse_all_attr but don't print on parsing error

The all functions are for multiline content, one attribute is parsed per line.

Important: If you have several matching attribute on the same line, the last one is taken.

Examples:

IMG='<img href="http://foo.com/bar.jpg" alt="">'
CONTENT=$(echo "$IMG" | parse_attr img alt) || return        # error, return called
PAGE='<a href="http://...">click here to download</a>'
LINK=$(echo "$PAGE" | parse_attr 'download' 'href') || return
log_debug "[$LINK]"          # [http://...]
IMG='<img href="http://foo.com/bar.jpg" id = image_id>'
ID=$(echo "$IMG" | parse_attr 'id') || return

Note: "parse_attr b" is equivalent to "parse_attr . b" and "parse_attr b b".

Some websites return page as a single big line of HTML (without any eol). As parse_xxx functions are per-line oriented, proper parsing can be difficult. Two functions exists: break_html_lines and break_html_lines_alt (more aggressive) to split single line HTML.

Parsing HTML forms

core.sh script provides some functions.

Assume here, for our example, curl retrieved HTML page and stored it in $HTML_PAGE variable.

Name Usage example
grep_form_by_order HTML_FORM=$(grep_form_by_name "$HTML_PAGE" 2)
grep_form_by_name HTML_FORM=$(grep_form_by_name "$HTML_PAGE" 'named_form')
grep_form_by_id HTML_FORM=$(grep_form_by_name "$HTML_PAGE" 'id_form')
parse_form_action ACTION=$(echo "$HTML_FORM" | parse_form_action)
parse_form_input_by_id VALUE=$(echo "$HTML_FORM" | parse_form_input_by_id 'label')
parse_form_input_by_name VALUE=$(echo "$HTML_FORM" | parse_form_input_by_name 'login')
parse_form_input_by_type VALUE=$(echo "$HTML_FORM" | parse_form_input_by_type 'submit')

You are strongly encouraged to append regular || return error handling.

Example:

FORM_URL=$(grep_form_by_order "$HTML_PAGE" 1 | parse_form_action) || return
# We are sure here, that $HTML_PAGE has a form with an action attribute
# We can safely use $FORM_URL now

Note: parse_form_input_by_id_quiet, parse_form_input_by_name_quiet and parse_form_input_by_type_quiet are available.

Like other *_quiet functions, there's not error message and do always return 0. You generally use them when you want to parse a html form field with possible empty value. For example:

FORM_SID=$(echo "$FORM_HTML" | parse_form_input_by_id_quiet 'sid')
# $FORM_SID can be 0 for anonymous users and it can be defined (non empty) for account user.

Captcha helper functions

core.sh script provides some functions.

captcha_process

Arguments:

  • $1: local image file (any format),
  • $2: (optional) solving method. "prompt" is the default.
  • $3: (optional) viewing method. "none" is the default.

In most usual cases, $3 should be left empty, the best image viewer (using X or not) is chosen.

Current solving methods:

  • prompt (defaults, manual entry)
  • ocr_digit (calls tesseract, if not installed fallback to prompt)
  • ocr_upper (calls tesseract, if not installed fallback to prompt)

Other methods are private.

Results:

  • stdout (2 lines) : captcha answer (ascii text) / transaction id
  • $?: 0 for success, or $ERR_CAPTCHA, $ERR_FATAL, $ERR_NETWORK

Typical usage: ($CAPTCHA_IMG is a valid image file)

local WI WORD ID
WI=$(captcha_process "$CAPTCHA_IMG" ocr_digit) || return
{ read WORD; read ID; } <<<"$WI"
rm -f "$CAPTCHA_IMG"

Note: If something goes wrong ($? is not 0), argument image file is deleted.

recaptcha_process

Argument:

  • $1: site key

Results:

  • stdout (3 lines) : captcha answer (ascii text) / recaptcha challenge / transaction id
  • $?: 0 for success, or $ERR_CAPTCHA, $ERR_FATAL, $ERR_NETWORK

Typical usage:

local PUBKEY WCI CHALLENGE WORD ID
PUBKEY='6Lftl70SAAABAItWJueKIVvyG5QfLgmAgtKgVbDT'
WCI=$(recaptcha_process $PUBKEY) || return
{ read WORD; read CHALLENGE; read ID; } <<<"$WCI"

positive or negative acknowledge

Each you call captcha_process or recaptcha_process, you get a transaction id as result. Once captcha result submitted, you must acknowledge captcha transaction.

There are two functions: captcha_ack or captcha_nack.

Argument:

  • $1: transaction id

Typical usage:

if match ... wrong captcha ...; then
    captcha_nack $ID
    log_error "Wrong captcha"
    return $ERR_CAPTCHA
fi

captcha_ack $ID
log_debug "correct captcha"

JSON parsing

Official format standard: RFC4627.

If you know nothing about JavaScript Object Notation, try this:

curl http://twitter.com/users/bob.json | python -mjson.tool

Functions: parse_json, parse_json_quiet

Simple and limited JSON parsing. sed command in internally used here.

Arguments:

  • $1: variable name (string)
  • $2 (optional): preprocess option. Accepted values are: join and split.
  • stdin: input JSON data

Results:

  • $?: 0 on success or $ERR_FATAL (non matching or empty result)
  • stdout: parsed content (non null string)

Important notes:

  • Single line parsing oriented (user should strip newlines first): no tree model
  • Array and Object types: no support
  • String type: no support for escaped unicode characters (\uXXXX)
  • No non standard C/C++ comments handling (like in JSONP)
  • If several entries exist on same line: last occurrence is taken, but: consider precedence (order of priority): number, boolean/empty, string.
  • If several entries exist on different lines: all are returned (it's a parse_all_json)

Simple usage:

FILE_URL=$(echo "$JSON" | parse_json 'downloadUrl') || return

Function match_json_true

Arguments:

  • $1: name (string)
  • $2: input data (json data)

Results:

  • $?: 0 for success; not null any error
  • stdout: nothing!

This will literally match for true boolean token, "true" string token or any number will be considered as false.

# Assuming that a curl request can result one of two $JSON content:
# JSON='{"err":"Entered digits are incorrect."}'
# JSON='{"ok":true,"dllink":"http:\/\/www.share-me.com\/..."}'

if ! match_json_true 'ok' "$JSON"; then
    ERR=$(echo "$JSON" | parse_json_quiet err)
    test "$ERR" && log_error "Remote error: $ERR"
    return $ERR_FATAL
fi
log_debug "ok answer..."

Module command-line switches

There are some specific modules options (see MODULE_XXX_DOWNLOAD_OPTIONS / MODULE_XXX_UPLOAD_OPTIONS / MODULE_XXX_DELETE_OPTIONS or MODULE_XXX_LIST_OPTIONS):

Authentication

AUTH,a:,auth:,USER:PASSWORD,Premium account
AUTH_FREE,b:,auth-free:,USER:PASSWORD,Free account

Most of the time, when a module can deal with both free and premium, we will see a single option:

AUTH,a:,auth:,USER:PASSWORD,User account

For delete, it's quite usual that authentication is mandatory for deleting files, you'll see:

AUTH,a:,auth:,USER:PASSWORD,User account (mandatory)

Download usual options

LINK_PASSWORD,p:,link-password:,PASSWORD,Used in password-protected files

Ask for password if not supplied:

log_debug "File is password protected"
if [ -z "$LINK_PASSWORD" ]; then
    LINK_PASSWORD="$(prompt_for_password)" || return
fi

Upload usual options

LINK_PASSWORD,p:,link-password:,PASSWORD,Protect a link with a password
DESCRIPTION,d:,description:,DESCRIPTION,Set file description
FROMEMAIL,,email-from:,EMAIL,<From> field for notification email
TOEMAIL,,email-to:,EMAIL,<To> field for notification email

Guidelines

  • Consider module option variables (AUTH, LINK_PASSWORD, ...) as read only, don't reassign them.

Coding rules

Bash pitfall: quote variable when it contains several lines

WAIT_TIME=$(echo $WAIT_HTML | parse 'foo' '.. \(...\) ..')

Won't give you expected answer if $WAIT_HTML is multiline (which is most of the time the case). You should write instead:

WAIT_TIME=$(echo "$WAIT_HTML" | parse 'foo' '.. \(...\) ..')

Consider this example for understanding:

$ MYS=$(seq 3)
$ echo "$MYS"
1
2
3
$ echo $MYS
1 2 3
$ echo $MYS | xxd
0000000: 3120 3220 330a                           1 2 3.

Bash pitfall: no local keyword with || return

Unfortunately, this is not correct:

local HTML_PAGE=$(curl "$URL") || return

If curl function returns an error, it won't be catched by || return because of the local keyword.

local HTML_PAGE
...
HTML_PAGE=$(curl "$URL") || return

is correct.

Bash pitfall: avoid single statement "&& ||" test

$ set -- test
$ [ -z "$1" ] && echo empty || echo nonempty
nonempty
$ set --
$ [ -z "$1" ] && echo empty || echo nonempty
empty
$ set -- test
$ [ -z "$1" ] || echo nonempty && echo empty
nonempty
empty
$ set --
$ [ -z "$1" ] || echo nonempty && echo empty
empty

Looks like "&& ||" is better than "|| &&". But imagine that echo empty does not return $?=0:

$ set --
$ [ -z "$1" ] && echo empty; false || echo nonempty
empty
nonempty

Finally, classic if/then/else/fi is not so bad!

if [ -z "$1"]; then
    echo empty
else
    echo nonempty
fi

Portability

Plowshare is running on lots of unix/linux systems. There is always several ways to write bash code. We try to keep compatibility with busybox shell.

Things to take care or avoid in your module functions:

  • no awk invocation
  • no xargs invocation
  • no grep -v (invert match) invocation
  • no wc invocation (wc -c can be easily replaced with bash internal string manipulation, example)
  • BSD sed has less feature than GNU sed (can't use \? or \r for example). Try to use parse_* functions instead
  • readlink -f is available on GNU but not on BSD.
  • no infinite loops like "while true;" or "while :;".
  • no tr -d, try using bash internal replacement. For example: ${MYSTRING//$'\n'} or replace for multiline content.

Bash specific construct to avoid:

  • no bash regexp: [[ =~ ]] (requires bash >=3.0). This is an historic choice not using it.
  • no += string concatenation operator (requires bash >=3.1)
  • no for loop expand sequence: for i in {1..10} ; do ... ; done (requires bash >=3.0). You can use seq instead.

Busybox specific pitfalls:

  • grep -o and grep -w (word-regexp) are not supported by old versions of busybox. Do not use them.
  • sleep with s/m suffixes or even fractional argument (example: sleep 1m). BusyBox may not be compiled with CONFIG_FEATURE_FANCY_SLEEP option.
  • tr with classes (such as [:upper:]). BusyBox may not be compiled with CONFIG_FEATURE_TR_CLASSES option.
  • sed does not support \xNN escaped sequences. Tested on Busybox 1.13, 1.18 and 1.19.3.
  • sed does not support \r escaped sequence before version 1.19 (commit). Don't use it, find another way!
  • sed do support \s, \S, \w, \W (these are GNU extensions). But prefer using the equivalent: [[:space:]], [^[:space:]], [[:alnum:]_], [^[:alnum:]_].

Try being compliant with bash 3.x. Interesting reading:

Miscellaneous remarks

  • Do not create temporal files unless necessary, don't forget to delete it if you used one.
  • curl calls should not be invoked with --silent option. curl wrapper function take care of verbose level.

Why is this so complicated and constrained?

It's because we want to be portable as much as possible. We loose flexibility, but it can be run on slow and old embedded hardware, this is the original starting point of the project. But maybe plowshare with bash 4.0 as minimum requirement will pop-up one day...

Coding style

General Rules

  • GPL-compatible license.
  • No tabs, use 4 spaces. Also use 4 spaces after splitted "\" lines
  • Line lengths should stay within 80 columns.
  • Comments (like ruby) are written in english. No extra empty line before function declaration. No boxes or ascii art stuff.
  • Always declare (with local keyword) variables you are using.

Naming Rules

  • Uppercase variables. We suggest using underscore in it. For example: MARY_POPPINS (instead of MARYPOPPINS). This is optional but recommended (especially for names with more than 7 characters). For example APIURL, DESTFILE and FILEURL are accepted. COOKIEFILE is accepted too (but COOKIE_FILE is prefered).
  • Use appropriate names to ease maintainability. For example: FILE_URL (instead of MARY_POPPINS). Don't use too long variable name: for example UPLOADED_FILE_JSON_DATA is too descriptive, JSON_DATA or JSON is enough.
  • For form parsing, usual names are: FILE_ID, FILE_NAME, FILE_URL, BASE_URL, FORM_HTML, FORM_URL (action parameter), FORM_xxx (input field name in uppercase), ADMIN_URL, DELETE_ID, WAIT_TIME.
  • Usual names for curl results are HTML, PAGE, RESPONSE, JSON, STATUS.

Strongly recommended guidelines

1. if/then construct and while/do are on the same line.

2. Restrict usage of curly braces:

test "$FILE_URL" || { log_error "location not found"; return $ERR_FATAL; }

should be written:

if test "$FILE_URL"; then
    log_error "location not found"
    return $ERR_FATAL
fi

3. In comment, insert a space character of # symbol

#get id of file                  (wrong)
# Get id of file                 (right)

4. Proper indentation on continued lines

HTML=$(curl -b "$COOKIE_FILE" 'http://www.foo.bar/long...url...') \
|| return $ERR_FATAL

should be written:

HTML=$(curl -b "$COOKIE_FILE" 'http://www.foo.bar/long...url...') || \
    return $ERR_FATAL

5. Simple quote strings as much as possible If there is no variable referencing of course!

local BASE_URL="http://shareme.com"        (wrong)
local BASE_URL='http://shareme.com'        (right)

Testing

Test and retest your module. Little check-list of possible cases:

  • File not found
  • File temporarily unavailable
  • File unavailable (server busy), come back in X minutes
  • Download (quota) limit reached
  • Your IP address is already downloading a file
  • Password protected link
  • Premium link download only
  • etc.

Other concerns:

  • Check for geographical location aware sites, it can affect url TLD
  • Don't send incomplete script or nearly-working stuff.
  • Don't use illegal or patented content, if you want to make some test, use material here: http://www.thinkbroadband.com/download.html.

If possible, update test suite. One of these:

Send your contribution

Create a patch file (git diff or diff -u, ...) and create a new issue.

Commit message should have the following standard format:

module_name: one-line summary
 

Message body. Describe as much as possible your changeset.

If possible, be more precise in the first line et add matter in parentheses.

For example:

mediafire (download): add password link support

rapidshare (upload): fix account support

External documentation

Comment by amsellem...@gmail.com, Dec 1, 2010

Thank You very mush =) !! Jonathan.

Comment by unhack...@gmail.com, Apr 10, 2011

sweet! if i knw how to do one thing, its bash :D..


Sign in to add a comment
Powered by Google Project Hosting