Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] high rate of killed jobs
- Date: Wed, 28 Mar 2018 09:50:31 +0200
- From: Almudena Montiel <almudena.montiel@xxxxxx>
- Subject: [HTCondor-users] high rate of killed jobs
Hello,
I am trying to understand this behaviour: I find very often that jobs
are exited with status 102. In the configuration we have defined not to
preempt neither kill jobs, these variables:
SUSPEND = FALSE
PREEMPT = FALSE
PREEMPTION_REQUIREMENTS = FALSE
KILL = FALSE
One example:
SchedLog
03/27/18 19:42:57 (pid:287435) Starting add_shadow_birthdate(698148.0)
03/27/18 19:42:57 (pid:287435) Started shadow for job 698148.0 on
slot1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
for group_atlas.prod.atlprod033, (shadow pid = 2185496)
..
03/27/18 19:44:03 (pid:287435) Shadow pid 2185496 for job 698148.0
exited with status 102
03/27/18 19:44:03 (pid:287435) Checking consistency running and runnable
jobs
03/27/18 19:44:03 (pid:287435) Tables are consistent
03/27/18 19:44:03 (pid:287435) Rebuilt prioritized runnable job list in
0.001s.
03/27/18 19:44:03 (pid:287435) match (slot1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
for group_atlas.prod.atlprod033) out of jobs; relinquishing
03/27/18 19:44:03 (pid:287435) Match record (slot1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
for group_atlas.prod.atlprod033, 698148.-1) deleted
03/27/18 19:44:03 (pid:287435) Completed RELEASE_CLAIM to startd
slot1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
for group_atlas.prod.atlprod033
ShadowLog
03/27/18 19:42:57 (698148.0) (2185496): Request to run on
slot1_1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
was ACCEPTED
[..]
03/27/18 19:42:57 (698148.0) (2185496): File transfer completed
successfully.
[...]
03/27/18 19:43:34 (698148.0) (2185496): Requesting graceful removal of job.
[..]
03/27/18 19:44:03 (698148.0) (2185496): Job 698148.0 is being evicted
from slot1_1@xxxxxxxxxxxxxxxxxx
03/27/18 19:44:03 (698148.0) (2185496): **** condor_shadow
(condor_SHADOW) pid 2185496 EXITING WITH STATUS 102
And in the node running the job:
/var/log/condor/StarterLog.slot1_1
03/27/18 19:42:57 (pid:32723) Job 698148.0 set to execute immediately
03/27/18 19:42:57 (pid:32723) Starting a VANILLA universe job with ID:
698148.0
03/27/18 19:42:57 (pid:32723) IWD: /var/lib/condor/execute/dir_32723
03/27/18 19:42:57 (pid:32723) Output file:
/var/lib/condor/execute/dir_32723/_condor_stdout
03/27/18 19:42:57 (pid:32723) Error file:
/var/lib/condor/execute/dir_32723/_condor_stderr
03/27/18 19:42:57 (pid:32723) Renice expr "0" evaluated to 0
03/27/18 19:42:57 (pid:32723) Using wrapper /usr/sbin/mjf-job-wrapper to
exec /var/lib/condor/execute/dir_32723/condor_exec.exe
03/27/18 19:42:57 (pid:32723) Running job as user atlprod033
03/27/18 19:42:57 (pid:32723) Create_Process succeeded, pid=32727
03/27/18 19:43:34 (pid:32723) Got SIGTERM. Performing graceful shutdown.
03/27/18 19:43:34 (pid:32723) ShutdownGraceful all jobs.
03/27/18 19:44:03 (pid:32723) Got SIGQUIT. Performing fast shutdown.
03/27/18 19:44:03 (pid:32723) ShutdownFast all jobs.
03/27/18 19:44:03 (pid:32723) Process exited, pid=32727, signal=9
03/27/18 19:44:03 (pid:32723) Last process exited, now Starter is exiting
03/27/18 19:44:03 (pid:32723) **** condor_starter (condor_STARTER) pid
32723 EXITING WITH STATUS 0
I am newbie with htcondor, therefore I'd appreciate any hint to help me
understand why jobs are exiting this way.
cheers,
Almudena
--
========================================================================
Almudena Montiel Gonzalez e-mail: almudena.montiel@xxxxxx
Dept. Theoretical Physics. Block 15.
Laboratory of High Energy Physics
Universidad Autonoma de Madrid.
Phone: 34 91 497 4541 Fax: 34 91 497 3936
James Watt 2, Cantoblanco, 28049 Madrid, Spain.
========================================================================