Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] jobs stuck; cannot get rid of them.
- Date: Thu, 29 Nov 2018 13:35:24 +0000
- From: Stephen Jones <sjones@xxxxxxxxxxxxxxxx>
- Subject: Re: [HTCondor-users] jobs stuck; cannot get rid of them.
Don't worry - I found a way. For the record: Get the slot name and the
machine:
# condor_status -af Name -af Machine -af JobId -af Cpus | grep -v undef
| sort | sed -e "s/\.0//"Â | grep 1080321
slot1_5@xxxxxxxxxxxxxxxxxxxx r26-n02.ph.liv.ac.uk 1080321 1
Go to the machine and get the PID:
# ps -ef | grep condor_ | grep "slot1_1 "
condorÂÂ 12333 11661Â 0 Nov19 ?ÂÂÂÂÂÂÂ 00:00:05 condor_starter -f -a
slot1_1 igrid5.ph.liv.ac.uk
Kill that process; job done. Cheers,
Ste
On 29/11/18 13:24, Stephen Jones wrote:
Hi all,
There is a discrepancy between what condor_q thinks is runing, and
what condor_status things is running. I run this set of commands to
see the difference.
# condor_status -af JobId -af Cpus | grep -v undef | sort | sed -e
"s/\.0//"> s
# condor_q -af ClusterId -af RequestCpus -constraint
"JobStatus=?=2" | sort > q
# diff s q
1,4d0
< 1079641 8
< 1080031 8
< 1080045 8
< 1080321 1
See; condor_status has 4 jobs that actually don't exist in condor_q !?!
They've been there for days, since I had some Linux problems that
needed a reboot (not very related to htcondor.)
So I'm losing 25 slots, due to this. How can I purge this stale
information from the HTCondor system, good and proper?
Cheers,
Ste
--
Steve Jones sjones@xxxxxxxxxxxxxxxx
Grid System Administrator office: 220
High Energy Physics Division tel (int): 43396
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
University of Liverpool http://www.liv.ac.uk/physics/hep/