Subject: [Condor-users] "Claimed Idle" state on XP execute nodes, sched still thinks they're running
Hi
We have a couple of flocked pools, with the submit machine being in one
pool and the execute XP boxes being in the other. We're seeing jobs
getting successfully matched and initiated, but after a while all the
jobs go into a "Claimed Idle" state on the execute nodes, whereas a
condor_q on the submit node gives their status as running. Is there a
way to automaticaly recover from this state? For example, here's a
snipped from the StartLog on one of the XP nodes when it changes state:
2/3 19:58:26 DaemonCore: Command received via UDP from host
<172.24.116.233:9500>
2/3 19:58:26 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
2/3 19:58:26 Starter pid 3008 exited with status 4
2/3 19:58:26 State change: starter exited
2/3 19:58:26 Changing activity: Busy -> Idle
The StarterLog has:
2/3 19:58:26 condor_read(): recv() returned -1, errno = 10054, assuming
failure.
2/3 19:58:26 ERROR "Assertion ERROR on (result)" at line 270 in file
..\src\condor_starter.V6.1\NTsenders.C
This looks like a network error, e.g. packet loss, right? If so, why
doesn't the schedd pick up on this? This particular machine has been in
this state now for approximately 14 hours. Is there a setting we can
tweak on the submit node to make it aware of the situation more quickly?