HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] procd & starter hang



Hello,

Through rust, I've run across a problem like this:

> 07/04/11 05:23:49 Calling HandleReq <command_release_claim> (0)
> 07/04/11 05:23:49 slot6: Got RELEASE_CLAIM while in Preempting state, ign=
oring.
> 07/04/11 05:23:49 Return from HandleReq <command_release_claim> (handler:=
 0.000s, sec: 0.001s)
> 07/04/11 05:23:49 Return from Handler <DaemonCore::HandleReqSocketHandler=
> 0.0011s
> 07/04/11 05:23:57 Calling Handler <receiveJobClassAdUpdate> (4)
> 07/04/11 05:23:57 Return from Handler <receiveJobClassAdUpdate> 0.0002s
> 07/04/11 05:24:09 slot6: starter (pid 14997) is not responding to the req=
uest to hardkill its job.  The startd will now directly hard kill the start=
er and all its decendents.
> 07/04/11 05:24:13 error writing to named pipe: watchdog pipe has closed
> 07/04/11 05:24:13 LocalClient: error sending message to server
> 07/04/11 05:24:13 ProcFamilyClient: failed to start connection with ProcD
> 07/04/11 05:24:13 kill_family: ProcD communication error
> 07/04/11 05:24:13 ERROR "ProcD has failed" at line 571 in file /home/cond=
or/execute/dir_17572/userdir/src/condor_utils/proc_family_proxy.cpp
> 07/04/11 05:24:13 slot4: Changing state and activity: Claimed/Busy -> Pre=
empting/Killing
> 07/04/11 05:24:13 slot3: Changing state and activity: Claimed/Busy -> Pre=
empting/Killing
> 07/04/11 05:24:13 slot8: Changing state and activity: Claimed/Busy -> Pre=
empting/Killing
> 07/04/11 05:24:13 slot7: Changing state and activity: Claimed/Busy -> Pre=
empting/Killing
> 07/04/11 05:24:13 slot1: Changing state and activity: Claimed/Busy -> Pre=
empting/Killing
> 07/04/11 05:24:13 slot2: Changing state and activity: Claimed/Busy -> Pre=
empting/Killing
> 07/04/11 05:24:13 slot5: Changing state and activity: Claimed/Busy -> Pre=
empting/Killing
> 07/04/11 05:24:13 startd exiting because of fatal exception.

I remember that someone (maybe Brian Bockelman?) had discovered either
this, or a very similar problem, concerning a race condition between
these three daemons and fixed it. 

Could that someone and I have a talk? :)

Thank you.

-pete