Skip to content

chaos/nodediag

Repository files navigation

Nodediag provides an extensible TAP framework for executing node diagnostic checks at system startup.

Tests installed in /etc/nodediag.d/ are run in parallel by the nodediag command:

Checking cpucount:                                          [  OK  ]
Checking ecc:                                               [  OK  ]
Checking infiniband:                                        [NOTRUN]
Checking ethernet:                                          [  OK  ]
Checking mpt2sas:                                           [  OK  ]
Checking mptsas:                                            [  OK  ]
Checking network:                                           [  OK  ]
Checking sbe:                                               [  OK  ]
Checking swap:                                              [  OK  ]
Checking dmi:                                               [  OK  ]
Checking tw:                                                [NOTRUN]
Checking tapwrap:                                           [ FAIL ]
Checking hdparm:                                            [  OK  ]

Expected results are configured in /etc/sysconfig/nodediag or /etc/sysconfig/nodediag.d/testname.

Tests can be listed with nodediag -l:

cpucount:        Check number of CPU cores
dmi:             Check dmi table values
ecc:             Check EDAC ECC type
ethernet:        Check ethernet config
hdparm:          Check hard drive read performance
infiniband:      Check infiniband config
mpt2sas:         Check mpt2sas cards
mptsas:          Check mptsas cards
network:         Check network config
sbe:             Check Single Bit Memory Error Count
swap:            Check for expected amount of swap
tapwrap:         Run non-TAP tests: nvidia 
tw:              Check 3ware cards

An init script is provided so diagnostics can be run at startup, with the (verbose) results redirected to /var/log/nodediag. Alternatively nodediag can be run as part of a cluster bringup procedure, determining whether a node is fit to start cluster services; or as part of a resource manager epilog between batch jobs to detect emerging problems.