Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] jobs stuck; cannot get rid of them.
- Date: Mon, 03 Dec 2018 14:56:55 +0100
- From: Harald van Pee <pee@xxxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] jobs stuck; cannot get rid of them.
Hi all,
does it happen newly with htcondor 8.6 ?
Never seen it here with htcondor 8.4 and debian 7,8 and 9.
Best
Harald
On Monday, December 3, 2018 10:14:48 AM CET Oliver Freyermuth wrote:
> Hi together,
>
> we also observe this regularly. Users complain they still are accounted
> resources in condor_userprio without running jobs, then I go and check on
> all nodes, and find condor_starters running without any job.
>
> I'm currently using:
>
> for A in <allOurComputeNodes>; do echo $A; ssh $A 'for P in $(pidof
> condor_starter); do CHILD_CNT=$(ps --ppid $P --no-headers | wc -l); if [
> $CHILD_CNT -eq 0 ]; then echo "HTCondor Bug"; pstree -p $P; kill $P; fi;
> done'; done
>
> to clean up those, but of course that may also catch starters which are just
> in file transfer, or waiting for new jobs (CLAIM_WORKLIFE).
>
> It seems to be triggered if the compute node is busy (swapping, hanging) for
> a short while and not giving timely responses via network. A better fix
> than my described hack would be greatly appreciated.
>
> Cheers,
> Oliver
>
> Am 29.11.18 um 21:13 schrieb Collin Mehring:
> > Hi Stephen,
> >
> > We ran into this too. In our case the condor_starter process that was
> > handling each of those jobs didn't exit properly and was still running.
> > Connecting to the host and killing the stuck condor_starter process fixed
> > the issue.
> >
> > Alternatively, restarting Condor on the hosts will also get rid of
> > anything still running and update the collector.
> >
> > Hope that helps,
> > Collin
> >
> > More Information for the curious:
> >
> > Here's the end of the StarterLog for one of the affected slots:
> > 08/28/18 18:51:52 ChildAliveMsg: failed to send DC_CHILDALIVE to parent
> > daemon at /<IP removed>/ (try 1 of 3): SECMAN:2003:TCP connection to
> > daemon at /<IP removed>/ failed. 08/28/18 18:53:34 ChildAliveMsg: giving
> > up because deadline expired for sending DC_CHILDALIVE to parent. 08/28/18
> > 18:53:34 Process exited, pid=43481, status=0
> >
> > The pid listed was for the job running on that slot, which successfully
> > exited and finished elsewhere.
> >
> > We noticed this happening because it was affecting the group accounting
> > during negotiation. The negotiator would allocate the correct number of
> > slots using the number of jobs from the Schedd, but would then skip
> > negotiation for that group because it used the incorrect number of jobs
> > from the collector when determining current resource usage.
> >
> > Here's an example where the submitter had only one pending 1-core job and
> > no running jobs, but there were two stuck slots with 32-core jobs from
> > that submitter: 11/26/18 12:40:59 group quotas: group= prod./<group
> > removed>/ quota= 511.931 requested= 1 allocated= 1 unallocated= 0
> > <...>
> > 11/26/18 12:40:59 subtree_usage at prod./<group removed>/ is 64
> > <...>
> > 11/26/18 12:41:01 Group prod./<group removed>/ - skipping, at or over
> > quota (quota=511.931) (usage=64) (allocation=1)>
> > On Thu, Nov 29, 2018 at 5:25 AM Stephen Jones <sjones@xxxxxxxxxxxxxxxx
<mailto:sjones@xxxxxxxxxxxxxxxx>> wrote:
> > Hi all,
> >
> > There is a discrepancy between what condor_q thinks is runing, and
> > what
> > condor_status things is running. I run this set of commands to see the
> > difference.
> >
> > # condor_status -af JobId -af Cpus | grep -v undef | sort | sed -e
> > "s/\.0//"> s
> > # condor_q -af ClusterId -af RequestCpus -constraint
> > "JobStatus=?=2"
> >
> > | sort > q
> >
> > # diff s q
> > 1,4d0
> > < 1079641 8
> > < 1080031 8
> > < 1080045 8
> > < 1080321 1
> >
> > See; condor_status has 4 jobs that actually don't exist in condor_q
> > !?!
> >
> > They've been there for days, since I had some Linux problems that
> > needed
> > a reboot (not very related to htcondor.)
> >
> > So I'm losing 25 slots, due to this. How can I purge this stale
> > information from the HTCondor system, good and proper?
> >
> > Cheers,
> >
> > Ste
> >
> >
> >
> > _______________________________________________
> > HTCondor-users mailing list
> > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> > <mailto:htcondor-users-request@xxxxxxxxxxx> with a subject:
> > Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/htcondor-users/