Changing SUBMIT_ATTRS will affect jobs that are submitted after the change. but it will not affect jobs already in the Schedd. use condor_qedit to change jobs in the Schedd. -tj From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of Jin Mao Condor users, I have a use case that my worker nodes may come and go ungracefully after jobs scheduled on them. I hope to learn the right way to move the jobs to other worker nodes as soon as possible. In below example, job_lease_duration was not defined, so it is 2400 seconds default value. When a worker node suddenly left without gracefully shutdown, the job will not be rescheduled on another spare worker until the end of job_lease_duration
window. 05/13/19 10:46:58 (2.1) (59635): JobLeaseDuration remaining: 1756 05/13/19 10:46:58 (2.1) (59635): Scheduling another attempt to reconnect in 300 seconds 05/13/19 10:51:58 (2.1) (59635): Attempting to locate disconnected starter 05/13/19 10:52:18 (2.1) (59635): attempt to connect to <10.143.128.35:9618> failed: timed out after 20 seconds. 05/13/19 10:52:18 (2.1) (59635): locateStarter(): Failed to connect to startd <10.143.128.35:9618?addrs=10.143.128.35-9618&noUDP&sock=3439_c2b1_3>
05/13/19 10:52:18 (2.1) (59635): JobLeaseDuration remaining: 1436 05/13/19 10:52:18 (2.1) (59635): Scheduling another attempt to reconnect in 300 seconds 05/13/19 10:57:18 (2.1) (59635): Attempting to locate disconnected starter 05/13/19 10:57:38 (2.1) (59635): attempt to connect to <10.143.128.35:9618> failed: timed out after 20 seconds. 05/13/19 10:57:38 (2.1) (59635): locateStarter(): Failed to connect to startd <10.143.128.35:9618?addrs=10.143.128.35-9618&noUDP&sock=3439_c2b1_3> ... 05/13/19 11:13:38 (2.1) (59635): JobLeaseDuration remaining: 156 05/13/19 11:13:38 (2.1) (59635): Scheduling another attempt to reconnect in 156 seconds 05/13/19 11:16:14 (2.1) (59635): Attempting to locate disconnected starter 05/13/19 11:16:34 (2.1) (59635): attempt to connect to <10.143.128.35:9618> failed: timed out after 20 seconds. 05/13/19 11:16:34 (2.1) (59635): locateStarter(): Failed to connect to startd <10.143.128.35:9618?addrs=10.143.128.35-9618&noUDP&sock=3439_c2b1_3>
05/13/19 11:16:34 (2.1) (59635): JobLeaseDuration remaining: EXPIRED! 05/13/19 11:16:34 (2.1) (59635): Reconnect FAILED: Job disconnected too long: JobLeaseDuration (2400 seconds) expired 05/13/19 11:16:34 (2.1) (59635): Exiting with JOB_SHOULD_REQUEUE 05/13/19 11:16:34 (2.1) (59635): **** condor_shadow (condor_SHADOW) pid 59635 EXITING WITH STATUS 107 05/13/19 11:16:34 (pid:3299) Shadow pid 59635 for job 2.1 exited with status 107 05/13/19 11:16:34 (pid:3299) Match record (slot1@xxxxxxxxxxxxxxxxxxx <10.143.128.35:9618?addrs=10.143.128.35-9618&noUDP&sock=3439_c2b1_3>
for jin, 2.1) deleted 05/13/19 11:16:41 (pid:3299) Starting add_shadow_birthdate(2.1) 05/13/19 11:16:41 (pid:3299) Started shadow for job 2.1 on
slot1@xxxxxxxxxxxxxxxxxxx <10.143.128.46:9618?addrs=10.143.128.46-9618&noUDP&sock=3377_8c39_3> for jin, (shadow pid = 65855) 05/13/19 11:16:41 Initializing a VANILLA shadow for job 2.1 05/13/19 11:16:41 (2.1) (65855): Request to run on
slot1_1@xxxxxxxxxxxxxxxxxxx <10.143.128.46:9618?addrs=10.143.128.46-9618&noUDP&sock=3377_8c39_3> was ACCEPTED At this moment, I only see job_lease_duration helped but it requires users to update their submit file. I have tried SUBMIT_ATTRS with JobLeaseDuration like below. After reconfig/restart various condor daemons,
job ClassAd still shows JobLeaseDuration=2400. I am wondering if there is a better way to let schedd to relocate jobs before expiration of JobLeaseDuration. SUBMIT_ATTRS = JobLeaseDuration
JobLeaseDuration = 300
Thanks.
Jin.
|