Thanks Jamie.
I just remembered that i have configured it before. I have a transform which set the JobLeaseDuration = 1800
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Wednesday, June 29, 2022, 23:48
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Dealing with lost submitters.
One option is to set JOB_DEFAULT_LEASE_DURATION in the configuration files on the submitting machines. The default is 2400 seconds (40 minutes). This controls how long the submitter and executor will attempt to reconnect before aborting a job execution. The
downside to lowering this value is that you risk killing jobs in situations where an interruption is temporary. For example, when upgrading HTCondor or rebooting on the submit machine.
- Jaime
Hi all.
Sometime the submitting machine is out of resources for example disk space. the condor service will be stopped and the jobs on the executer side will wait for it.
So, in this situation there are waisted resources just waiting.
Usually, I do it manually by evicting this user jobs.
How to deal with it automatically?
Many thanks
David