Export to GitHub

parallel-ssh - issue #30

enhancement: option to check for open and active SSH connections on all hosts


Posted on Dec 17, 2010 by Happy Ox

it would be great if there was an option for parallel-ssh that would just connect and authenticate to all the nodes, reporting status for each one as it does now and then exit with 0 iff all the nodes were successfully connected and authenticated to.

thoughts?

Comment #1

Posted on Dec 20, 2010 by Happy Camel

I often do something like "pssh -h myhosts -t 5 echo hi" for this purpose. I believe that this would meet the needs that you describe; is there anything that it's missing? Let me know what you think.

Comment #2

Posted on Dec 21, 2010 by Happy Ox

That doesn't really address my problem because if the machines are down, pssh still exits with 0 so the caller can't determine if all the machines are up.

Normally it makes sense for pssh et al to exit(0) even if some commands fail, but not always.

The more I think about it now, the more I think all the tools need an extra option; something like "--exit-one-on-failure" that if passed will cause pssh et al to exit(1) if any of the requests fail.

That would solve my immediate problem by allowing

"pssh -h hosts -t 10 --exit-one-on-failure exit 0" || doFailureCode()

Comment #3

Posted on Dec 21, 2010 by Happy Camel

Hmm. Shouldn't pssh always exit with an error if there's a single failure. I had thought that this was already happening. The current behavior sounds like a bug to me; can you think of any particular reason that it should exit(0) even if some commands fail?

Comment #4

Posted on Dec 22, 2010 by Happy Ox

That certainly isn't happening right now.

My argument for it returning 0 is to be able to distinguish between pssh having a problem and the remote servers and/or ssh having a problem. This could also be accomplished with using different return codes for each. For example 1 for pssh failure (couldn't allocate memory, bad args, etc) and 2 for remote/ssh failure (timeout, key rejected, connection refused, remote command exited with non-zero return, etc). This is similar to how grep et al works. If grep matches anything, it exits 0. If it doesn't match anything, it returns 1.

I'm not against making it exit(somethingNotZero) if a ssh command failed by default, but I figured that was pretty explicit functionality to have in there so assumed it was done on purpose.

Comment #5

Posted on Jan 9, 2011 by Happy Camel

I like the idea of having different error codes to discern between different problems. Do you have any suggestions about what the error codes should mean? One possibility would be to return the number of hosts that failed, perhaps with a "-1" if it's some fatal early error (such as an invalid hosts file). Any thoughts?

Comment #6

Posted on Jan 9, 2011 by Happy Ox

negative returns can be somewhat of an issue on most systems, as can numbers above 255.

As examples try:

python -c 'import sys; sys.exit(-1)'; echo $?

and

python -c 'import sys; sys.exit(256)'; echo $?

http://www.gnu.org/software/libc/manual/html_node/Exit-Status.html may be helpful to you here.

Since reporting the numbers of failures above 255 isn't possible, I don't think that's a workable solution since it would limit the use of pssh to less than 256 nodes which would be a real problem.

I would just do something simple like:

0: OK 1: pssh failure (couldn't execute a subprocess for one or more hosts for some reason) 2: ssh and/or remote failure of one or more hosts (subprocess was executed but returned non-zero)

Personally I don't think anything more is all that useful as in most cases there is nothing an automated caller could do to fix it and a interactive caller can read the output.

Comment #7

Posted on Jan 10, 2011 by Happy Camel

Doesn't a return code of -1 turn into 255? We could return the number of failed hosts up to 250 or something, with -1 being a pssh failure.

Or more in line with your proposal, it might make sense to have a different return code if all ssh commands fail than if only some of the ssh commands fail.

I suppose either of these would be better than what we're doing right now, but at the moment I don't have a strong preference.

Comment #8

Posted on Jan 18, 2011 by Happy Camel

Hmm. In addition to whether one or more processes failed, there is also the issue of whether a process returned a non-0 exit status. I need to think about this a bit more, but I think there are several different values of exit status that we might want to provide. Here's what I'm thinking right now:

0: all commands successful and returned 0 1: at least one remote command returned a non-0 value (but all commands ran) 2: at least one ssh command returned 255 (connection error, bad password, etc.) 3: at least one ssh process timed out or killed by a signal 4: internal pssh error

Analogous exit statuses would be used for prsync, pscp, etc. (although some might not exit with a value of 1). Any thoughts? Is there anything else missing from this list? I'll send an email to the mailing list to solicit additional input.

Comment #9

Posted on Jan 19, 2011 by Helpful Rhino

The errors you mention are not necessarily mutually exclusive. Use a bitfield; that is, assign powers of two to them and add them up.

Comment #10

Posted on Jan 19, 2011 by Happy Camel

Indeed they aren't mutually exclusive--my thought was to return the max (most severe). The bitfield idea is clever, but I'm not sure if I've come across it in this context. Is there any precedent for using bitfields for exit status codes? I know that bash provides an arithmetic operator for bitwise AND, but overall it seems like there isn't much shell-level support for this. What do you think?

Comment #11

Posted on Jan 19, 2011 by Happy Camel

I've looked into this, and so far I haven't been able to find any other programs that use bit fields for exit status. Combined with the fact that the "test" command doesn't have any bitwise operators, I'm edging towards the scheme from comment #8, with the plan to make the semantics clear in the man page.

Comment #12

Posted on Jan 21, 2011 by Happy Camel

Meaningful exit status codes were added to pssh in commit 4ef1fea. The pssh man page includes documentation on the subject. I still need to fix the other commands. Please let me know if you see any problems or if you have any last-minute feedback.

Comment #13

Posted on Jan 21, 2011 by Happy Camel

Okay, this is done for the others as well (although we still need to add man pages for these). I'm going to mark this as closed, but please reopen it if you see any concrete or subjective problems with the implementation. Thanks.

Comment #14

Posted on Jan 24, 2011 by Helpful Rhino

Works in bash:

bash -c 'bash -c "exit 5"; xit=$?; if (( $xit & 1 )); then echo "1 bit set"; fi; if (( $xit & 2 )); then echo "2 bit set"; fi; if (( $xit & 4 )); then echo "4 bit set"; fi;' 1 bit set 4 bit set

Works in tcsh:

/bin/tcsh -c ' /bin/tcsh -c "exit 5" set xit=$? if ( ( $xit & 1 ) != 0 ) then echo "1 bit set" endif if ( ( $xit & 2 ) != 0 ) then echo "2 bit set" endif if ( ( $xit & 4 ) != 0 ) then echo "4 bit set" endif ' 1 bit set 4 bit set

Generating the errors themselves does not require bitwise operators, just addition.

Status: Done

Labels:
Type-Defect Priority-Medium