Hello,
I am trying to understand this behaviour: I find very often that jobs
are exited with status 102. In the configuration we have defined not
to preempt neither kill jobs, these variables:
ÂSUSPEND = FALSE
ÂPREEMPT = FALSE
ÂPREEMPTION_REQUIREMENTS = FALSE
ÂKILL = FALSE
One example:
SchedLog
03/27/18 19:42:57 (pid:287435) Starting add_shadow_birthdate(698148.0)
03/27/18 19:42:57 (pid:287435) Started shadow for job 698148.0 on
slot1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
for group_atlas.prod.atlprod033, (shadow pid = 2185496)
..
03/27/18 19:44:03 (pid:287435) Shadow pid 2185496 for job 698148.0
exited with status 102
03/27/18 19:44:03 (pid:287435) Checking consistency running and
runnable jobs
03/27/18 19:44:03 (pid:287435) Tables are consistent
03/27/18 19:44:03 (pid:287435) Rebuilt prioritized runnable job list
in 0.001s.
03/27/18 19:44:03 (pid:287435) match (slot1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
for group_atlas.prod.atlprod033) out of jobs; relinquishing
03/27/18 19:44:03 (pid:287435) Match record (slot1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
for group_atlas.prod.atlprod033, 698148.-1) deleted
03/27/18 19:44:03 (pid:287435) Completed RELEASE_CLAIM to startd
slot1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
for group_atlas.prod.atlprod033
ShadowLog
03/27/18 19:42:57 (698148.0) (2185496): Request to run on
slot1_1@xxxxxxxxxxxxxxxxxx
<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>
was ACCEPTED
[..]
03/27/18 19:42:57 (698148.0) (2185496): File transfer completed
successfully.
[...]
03/27/18 19:43:34 (698148.0) (2185496): Requesting graceful removal of
job.
[..]
03/27/18 19:44:03 (698148.0) (2185496): Job 698148.0 is being evicted
from slot1_1@xxxxxxxxxxxxxxxxxx
03/27/18 19:44:03 (698148.0) (2185496): **** condor_shadow
(condor_SHADOW) pid 2185496 EXITING WITH STATUS 102
And in the node running the job:
/var/log/condor/StarterLog.slot1_1
03/27/18 19:42:57 (pid:32723) Job 698148.0 set to execute immediately
03/27/18 19:42:57 (pid:32723) Starting a VANILLA universe job with ID:
698148.0
03/27/18 19:42:57 (pid:32723) IWD: /var/lib/condor/execute/dir_32723
03/27/18 19:42:57 (pid:32723) Output file:
/var/lib/condor/execute/dir_32723/_condor_stdout
03/27/18 19:42:57 (pid:32723) Error file:
/var/lib/condor/execute/dir_32723/_condor_stderr
03/27/18 19:42:57 (pid:32723) Renice expr "0" evaluated to 0
03/27/18 19:42:57 (pid:32723) Using wrapper /usr/sbin/mjf-job-wrapper
to exec /var/lib/condor/execute/dir_32723/condor_exec.exe
03/27/18 19:42:57 (pid:32723) Running job as user atlprod033
03/27/18 19:42:57 (pid:32723) Create_Process succeeded, pid=32727
03/27/18 19:43:34 (pid:32723) Got SIGTERM. Performing graceful shutdown.
03/27/18 19:43:34 (pid:32723) ShutdownGraceful all jobs.
03/27/18 19:44:03 (pid:32723) Got SIGQUIT. Performing fast shutdown.
03/27/18 19:44:03 (pid:32723) ShutdownFast all jobs.
03/27/18 19:44:03 (pid:32723) Process exited, pid=32727, signal=9
03/27/18 19:44:03 (pid:32723) Last process exited, now Starter is exiting
03/27/18 19:44:03 (pid:32723) **** condor_starter (condor_STARTER) pid
32723 EXITING WITH STATUS 0
I am newbie with htcondor, therefore I'd appreciate any hint to help
me understand why jobs are exiting this way.
cheers,
Almudena