Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Parallel Scheduling - Handling of claims when jobs are on hold or are removed before starting
- Date: Tue, 15 Aug 2017 09:03:24 -0500
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Parallel Scheduling - Handling of claims when jobs are on hold or are removed before starting
On 08/13/2017 05:06 AM, Felix Wolfheimer wrote:
Just noticed recently the following behavior when using the parallel
universe. Whenever a job is submitted using the parallel universe and
this job starts claiming resources but has not started up, e.g., the
job requests 5 machines/slots but only 4 are free and get claimed and
the parallel job waits until a fifth slot gets available. If the job
is removed from the queue or set on hold (condor_rm, condor_hold) the
claims on the four machines/slots remain indefinitely (in my cases I
waited several hours and the claims were still there blocking
resources for the non-existent job). The only way to get rid of them
was to send a condor_reconfig command to the affected startds.
Thank you for your very descriptive bug report. We've now fixed this in
8.6, but not in time to make the upcoming release. As you point out,
the only workaround is to reconfig, or to run very short parallel jobs
to consume the slots (perhaps even a one core job).
-greg