Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] high rate of killed jobs

Date: Wed, 28 Mar 2018 10:05:04 +0200
From: Almudena Montiel <almudena.montiel@xxxxxx>
Subject: Re: [HTCondor-users] high rate of killed jobs

I forgot to mention, the version is condor-8.6.10-1.el6.x86_64


On 28/03/18 09:50, Almudena Montiel wrote:

Hello,
I am trying to understand this behaviour: I find very often that jobsare exited with status 102. In the configuration we have defined notto preempt neither kill jobs, these variables:
ÂSUSPEND = FALSE
ÂPREEMPT = FALSE
ÂPREEMPTION_REQUIREMENTS = FALSE
ÂKILL = FALSE

One example:

SchedLog

03/27/18 19:42:57 (pid:287435) Starting add_shadow_birthdate(698148.0)
03/27/18 19:42:57 (pid:287435) Started shadow for job 698148.0 onslot1@xxxxxxxxxxxxxxxxxx<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>for group_atlas.prod.atlprod033, (shadow pid = 2185496)
..
03/27/18 19:44:03 (pid:287435) Shadow pid 2185496 for job 698148.0exited with status 10203/27/18 19:44:03 (pid:287435) Checking consistency running andrunnable jobs
03/27/18 19:44:03 (pid:287435) Tables are consistent
03/27/18 19:44:03 (pid:287435) Rebuilt prioritized runnable job listin 0.001s.03/27/18 19:44:03 (pid:287435) match (slot1@xxxxxxxxxxxxxxxxxx<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>for group_atlas.prod.atlprod033) out of jobs; relinquishing03/27/18 19:44:03 (pid:287435) Match record (slot1@xxxxxxxxxxxxxxxxxx<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>for group_atlas.prod.atlprod033, 698148.-1) deleted03/27/18 19:44:03 (pid:287435) Completed RELEASE_CLAIM to startdslot1@xxxxxxxxxxxxxxxxxx<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>for group_atlas.prod.atlprod033
ShadowLog
03/27/18 19:42:57 (698148.0) (2185496): Request to run onslot1_1@xxxxxxxxxxxxxxxxxx<150.244.247.188:9618?addrs=150.244.247.188-9618+[2001-720-420-c003--273]-9618&noUDP&sock=30074_27b5_3>was ACCEPTED
[..]
03/27/18 19:42:57 (698148.0) (2185496): File transfer completedsuccessfully.
[...]
03/27/18 19:43:34 (698148.0) (2185496): Requesting graceful removal ofjob.
[..]
03/27/18 19:44:03 (698148.0) (2185496): Job 698148.0 is being evictedfrom slot1_1@xxxxxxxxxxxxxxxxxx03/27/18 19:44:03 (698148.0) (2185496): **** condor_shadow(condor_SHADOW) pid 2185496 EXITING WITH STATUS 102
And in the node running the job:

/var/log/condor/StarterLog.slot1_1

03/27/18 19:42:57 (pid:32723) Job 698148.0 set to execute immediately
03/27/18 19:42:57 (pid:32723) Starting a VANILLA universe job with ID:698148.0
03/27/18 19:42:57 (pid:32723) IWD: /var/lib/condor/execute/dir_32723
03/27/18 19:42:57 (pid:32723) Output file:/var/lib/condor/execute/dir_32723/_condor_stdout03/27/18 19:42:57 (pid:32723) Error file:/var/lib/condor/execute/dir_32723/_condor_stderr
03/27/18 19:42:57 (pid:32723) Renice expr "0" evaluated to 0
03/27/18 19:42:57 (pid:32723) Using wrapper /usr/sbin/mjf-job-wrapperto exec /var/lib/condor/execute/dir_32723/condor_exec.exe
03/27/18 19:42:57 (pid:32723) Running job as user atlprod033
03/27/18 19:42:57 (pid:32723) Create_Process succeeded, pid=32727
03/27/18 19:43:34 (pid:32723) Got SIGTERM. Performing graceful shutdown.
03/27/18 19:43:34 (pid:32723) ShutdownGraceful all jobs.
03/27/18 19:44:03 (pid:32723) Got SIGQUIT.Â Performing fast shutdown.
03/27/18 19:44:03 (pid:32723) ShutdownFast all jobs.
03/27/18 19:44:03 (pid:32723) Process exited, pid=32727, signal=9
03/27/18 19:44:03 (pid:32723) Last process exited, now Starter is exiting
03/27/18 19:44:03 (pid:32723) **** condor_starter (condor_STARTER) pid32723 EXITING WITH STATUS 0
I am newbie with htcondor, therefore I'd appreciate any hint to helpme understand why jobs are exiting this way.
cheers,

Almudena


--
========================================================================
Almudena Montiel Gonzalez              e-mail: almudena.montiel@xxxxxx
Dept. Theoretical Physics. Block 15.
Laboratory of High Energy Physics
Universidad Autonoma de Madrid.
Phone: 34 91 497 4541      Fax: 34 91 497 3936
James Watt 2, Cantoblanco, 28049 Madrid, Spain.
========================================================================

References:
- [HTCondor-users] high rate of killed jobs
  - From: Almudena Montiel

Prev by Date: [HTCondor-users] high rate of killed jobs
Next by Date: Re: [HTCondor-users] Forbidding users to run condor_reconfig
Previous by thread: [HTCondor-users] high rate of killed jobs
Next by thread: Re: [HTCondor-users] high rate of killed jobs
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] high rate of killed jobs