Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Schedd possibly spinning on a job
- Date: Thu, 16 May 2019 20:12:10 +0000
- From: Zach Miller <zmiller@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Schedd possibly spinning on a job
Thanks for the followup!
How large is large in your case? This is likely still something we'll want to fix since nobody wants their schedd taken down.
Cheers,
-zach
ïOn 5/16/19, 2:44 PM, "HTCondor-users on behalf of Larne Pekowsky via HTCondor-users" <htcondor-users-bounces@xxxxxxxxxxx on behalf of htcondor-users@xxxxxxxxxxx> wrote:
Hi all,
Just to close this out in case anyone is curious, the problem originated because this is a parallel universe job and the user inadvertently set the machine count to a very large number.
Cheers,
- Larne
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Larne Pekowsky via HTCondor-users
Sent: Thursday, May 16, 2019 3:02 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Larne Pekowsky <lppekows@xxxxxxx>
Subject: Re: [HTCondor-users] Schedd possibly spinning on a job
Hi Michael,
Thanks! With nothing else to lose I backed up the file and did
grep -v 6092282 ~/job_queue.log > job_queue.log
then restarted and that fixed it. Now we just need to figure out what it is about this job that caused thisâ
Cheers,
- Larne
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Michael Pelletier
Sent: Thursday, May 16, 2019 2:43 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Schedd possibly spinning on a job
The schedd state is stored in $(SPOOL)/job_queue.log â shutting down the schedd and editing this file by hand to excise the problem job looks like it would be a bit tricky and error-prone, however.
Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Larne Pekowsky via HTCondor-users
Sent: Thursday, May 16, 2019 2:35 PM
To: 'John M Knoeller' <johnkn@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Larne Pekowsky <lppekows@xxxxxxx>
Subject: [External] Re: [HTCondor-users] Schedd possibly spinning on a job
Hi tj,
I didnât know about condor_sos, thanks! Even with -timeoutmult 10 it didnât work though. Whatever the schedd is doing it isnât listening to anyone.
Cheers,
- Larne
From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Thursday, May 16, 2019 2:11 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Larne Pekowsky <lppekows@xxxxxxx>
Subject: RE: Schedd possibly spinning on a job
Did you using condor_sos before the condor_rm command?
D_ALL will definitely make the problem worse by the way. Itâs insanely chatty.
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Larne Pekowsky via HTCondor-users
Sent: Thursday, May 16, 2019 12:36 PM
To: 'htcondor-users@xxxxxxxxxxx' <htcondor-users@xxxxxxxxxxx>
Cc: Larne Pekowsky <lppekows@xxxxxxx>
Subject: [HTCondor-users] Schedd possibly spinning on a job
Hi all,
Our schedd has been pegged at 100% cpu for several hours and immediately returns to that state on restart. At D_FULLDEBUG the log floods with the message
05/16/19 12:50:58 satisfyJobs: finding resources for 6092282.0
so it almost looks like the schedd is stuck in a loop on this job. Iâd like to remove it to see if that fixes the problem, but of course with the schedd running at 100% condor_rm canât get through. Any suggestions? Also, is there any
way to get more detailed information on whatâs happening? D_ALL didnât seem to have anything useful.
Thanks,
- Larne