Sorry. I misunderstood what you were asking.   
 
Yes a smaller value of JobLeaseDuration will allow a job to be rescheduled more quickly when an execute node disappears unexpectedly.
 
The SUBMIT_ATTRS config knob is only read by condor_submit, not by the Schedd, so it is never necessary to reconfigure any daemons to have changes to SUBMIT_ATTRS take effect â but changes to this knob will not affect jobs already in the
 Schedd.  
 
HOWEVER - If the config knob JOB_DEFAULT_LEASE_DURATION is set then itâs value will be used by condor_submit when the user does not specify a value of in their submit file.    The default value for JOB_DEFAULT_LEASE_DURATION is 2400.
 
A value of JobLeaseDuration set by SUBMIT_ATTRS will be overridden by any value that the user specifies in their submit file, or by any value of the JOB_DEFAULT_LEASE_DURATION configuration knob.  Both of these are read by condor_submit,
 so no reconfiguration is needed for changes to take effect.  
 
-tj
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Jin Mao
Sent: Tuesday, May 14, 2019 9:14 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to lower job lease duration on worker failure?
 
The job was submitted after SUBMIT_ATTRS was added and condor daemon reconfigured. I am using 8.6.13. So, the questions are: 1) am I using SUBMIT_ATTRS and JobLeaseDuration correctly; 2) is adjusting job lease duration the right solution
 to address frequently server reset? 
 
 
Changing SUBMIT_ATTRS will affect jobs that are submitted after the change.  but it will not affect jobs already in the Schedd.
use condor_qedit to change jobs in the Schedd. 
 
-tj
 
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Jin Mao
Sent: Monday, May 13, 2019 10:23 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] How to lower job lease duration on worker failure?
 
Condor users,
I have a use case that my worker nodes may come and go ungracefully after jobs scheduled on them. I hope to learn the right way to move the jobs to other worker nodes as soon as
 possible. 
 
In below example,  job_lease_duration was not defined, so it is 2400 seconds default value. When a worker node suddenly left without gracefully shutdown, the job will not be rescheduled
 on another spare worker until the end of job_lease_duration window. 
 
05/13/19 10:46:58 (2.1) (59635): JobLeaseDuration remaining: 1756
 
05/13/19 10:46:58 (2.1) (59635): Scheduling another attempt to reconnect in 300 seconds
 
05/13/19 10:51:58 (2.1) (59635): Attempting to locate disconnected starter
 
05/13/19 10:52:18 (2.1) (59635): attempt to connect to <10.143.128.35:9618>
 failed: timed out after 20 seconds.
 
05/13/19 10:52:18 (2.1) (59635): JobLeaseDuration remaining: 1436
 
05/13/19 10:52:18 (2.1) (59635): Scheduling another attempt to reconnect in 300 seconds
 
05/13/19 10:57:18 (2.1) (59635): Attempting to locate disconnected starter
 
05/13/19 10:57:38 (2.1) (59635): attempt to connect to <10.143.128.35:9618>
 failed: timed out after 20 seconds.
 
 
05/13/19 11:13:38 (2.1) (59635): JobLeaseDuration remaining: 156
 
05/13/19 11:13:38 (2.1) (59635): Scheduling another attempt to reconnect in 156 seconds
 
05/13/19 11:16:14 (2.1) (59635): Attempting to locate disconnected starter
 
05/13/19 11:16:34 (2.1) (59635): attempt to connect to <10.143.128.35:9618>
 failed: timed out after 20 seconds.
 
05/13/19 11:16:34 (2.1) (59635): JobLeaseDuration remaining: EXPIRED!
 
05/13/19 11:16:34 (2.1) (59635): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (2400 seconds) expired
 
05/13/19 11:16:34 (2.1) (59635): Exiting with JOB_SHOULD_REQUEUE
 
05/13/19 11:16:34 (2.1) (59635): **** condor_shadow (condor_SHADOW) pid 59635 EXITING WITH STATUS 107
 
05/13/19 11:16:34 (pid:3299) Shadow pid 59635 for job 2.1 exited with status 107
 
05/13/19 11:16:41 (pid:3299) Starting add_shadow_birthdate(2.1)
 
05/13/19 11:16:41 Initializing a VANILLA shadow for job 2.1
 
 
 
At this moment, I only see job_lease_duration helped but it requires users to update their submit file. I have tried SUBMIT_ATTRS with JobLeaseDuration like below. After
 reconfig/restart various condor daemons, job ClassAd still shows JobLeaseDuration=2400. I am wondering if there is a better way to let schedd to relocate jobs before expiration of JobLeaseDuration. 
 
SUBMIT_ATTRS = JobLeaseDuration
JobLeaseDuration = 300
 
Thanks.
 
Jin.
 
 
 
 
 
 
 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to 
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/