I am getting a repeated sequence of
errors on my worker nodes where STARTD aborts due to a "fatal
exception". I only have two worker nodes, and they are both doing
these. An extract of StartLog is below. I am running 7.0.1. STARTD
on the cluster head nodes does work and jobs run there without a
problem. Suggestions as to why this is happening (appears to be due to "error opening watchdog pipe", but I can't be certain), and how to resolve it would be greatly appreciated. Cheers, Ian 3/31 14:18:12 slot3: State change: claiming protocol successful 3/31 14:18:12 slot3: Changing state: Matched -> Claimed 3/31 14:18:14 slot3: Got activate_claim request from shadow (<10.0.10.39:55786>) 3/31 14:18:14 slot3: Remote job ID is 1593.0 3/31 14:18:15 error opening watchdog pipe /tmp/condor-lock.mackenzie0.0513363986547155/procd_pipe.STARTD.watchdog: No such file or directory (2) 3/31 14:18:15 ProcFamilyClient: error initializing LocalClient 3/31 14:18:15 ProcFamilyProxy: error initializing ProcFamilyClient 3/31 14:18:15 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C 3/31 14:18:15 slot3: Changing state and activity: Claimed/Idle -> Preempting/Killing 3/31 14:18:15 slot3: State change: No preempting claim, returning to owner 3/31 14:18:15 slot3: Changing state and activity: Preempting/Killing -> Owner/Idle 3/31 14:18:15 slot3: State change: IS_OWNER is false 3/31 14:18:15 slot3: Changing state: Owner -> Unclaimed 3/31 14:18:15 startd exiting because of fatal exception. -- Ian Stokes-Rees W: http://sbgrid.org ijstokes@xxxxxxxxxxxxxxxxxxx T: +1 617 418-4168 SBGrid, Harvard Medical School F: +1 617 432-5600 |