[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] partitionable slot and jobs stuck in idle state



Hi all,

I have a problem with all jobs stuck in idle state if there is a job which
caused partitionable slot to be divided but is unable to run because of
lacking total amount of resources in cluster. The divided slots are claimed
but idle. No job can run if these slots don't match their requirements.


Simplified example:

I have two partitionable slots with 100 cpus each. 

First I run a parallel job which consists of 5 jobs and each job requires 50
cpu to run. HTCondor divides the partitionable slots into 4 dynamic slots
with 50 cpu each but it's not sufficient for the job to run. The job stays in
idle state forever and these slots are claimed but idle.

If I run another parallel job which consists of 3 jobs each requires 60 cpu to
run it will also unable to run despite the total amount of resources is
bigger than necessary and there is no job running.

Is there a solution to this problem? Maybe there is a way to put on hold such
a job (first one).


And one more question related with partitionable slots and claims. Even if I
set CLAIM_WORKLIFE=0 dynamic slots exist for some time (minutes) after they
run a job and stay in claimed state. Is this behaviour correct?


Best regards,
Stanislav Markevich