HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] procd & starter hang



Dan & I both hit this, Dan fixed the ProcD to stop crashing and I was
highly dubious of the killing code within the startd.

I noticed it when I was testing signal escalation.  

Cheers,
Tim

On Thu, 2011-07-07 at 16:41 -0500, Peter Keller wrote:
> Hello,
> 
> Through rust, I've run across a problem like this:
> 
> > 07/04/11 05:23:49 Calling HandleReq <command_release_claim> (0)
> > 07/04/11 05:23:49 slot6: Got RELEASE_CLAIM while in Preempting state, ign=
> oring.
> > 07/04/11 05:23:49 Return from HandleReq <command_release_claim> (handler:=
>  0.000s, sec: 0.001s)
> > 07/04/11 05:23:49 Return from Handler <DaemonCore::HandleReqSocketHandler=
> > 0.0011s
> > 07/04/11 05:23:57 Calling Handler <receiveJobClassAdUpdate> (4)
> > 07/04/11 05:23:57 Return from Handler <receiveJobClassAdUpdate> 0.0002s
> > 07/04/11 05:24:09 slot6: starter (pid 14997) is not responding to the req=
> uest to hardkill its job.  The startd will now directly hard kill the start=
> er and all its decendents.
> > 07/04/11 05:24:13 error writing to named pipe: watchdog pipe has closed
> > 07/04/11 05:24:13 LocalClient: error sending message to server
> > 07/04/11 05:24:13 ProcFamilyClient: failed to start connection with ProcD
> > 07/04/11 05:24:13 kill_family: ProcD communication error
> > 07/04/11 05:24:13 ERROR "ProcD has failed" at line 571 in file /home/cond=
> or/execute/dir_17572/userdir/src/condor_utils/proc_family_proxy.cpp
> > 07/04/11 05:24:13 slot4: Changing state and activity: Claimed/Busy -> Pre=
> empting/Killing
> > 07/04/11 05:24:13 slot3: Changing state and activity: Claimed/Busy -> Pre=
> empting/Killing
> > 07/04/11 05:24:13 slot8: Changing state and activity: Claimed/Busy -> Pre=
> empting/Killing
> > 07/04/11 05:24:13 slot7: Changing state and activity: Claimed/Busy -> Pre=
> empting/Killing
> > 07/04/11 05:24:13 slot1: Changing state and activity: Claimed/Busy -> Pre=
> empting/Killing
> > 07/04/11 05:24:13 slot2: Changing state and activity: Claimed/Busy -> Pre=
> empting/Killing
> > 07/04/11 05:24:13 slot5: Changing state and activity: Claimed/Busy -> Pre=
> empting/Killing
> > 07/04/11 05:24:13 startd exiting because of fatal exception.
> 
> I remember that someone (maybe Brian Bockelman?) had discovered either
> this, or a very similar problem, concerning a race condition between
> these three daemons and fixed it. 
> 
> Could that someone and I have a talk? :)
> 
> Thank you.
> 
> -pete
> 
> _______________________________________________
> Condor-devel mailing list
> Condor-devel@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-devel