Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] "Job has not yet been considered by the matchmaker" after condor_qedit
- Date: Thu, 31 May 2018 11:49:43 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] "Job has not yet been considered by the matchmaker" after condor_qedit
On 5/30/2018 2:04 PM, Vaurynovich, Siarhei wrote:
Hello,
*Please, let me know if there is a way to force HTCondor matchmaker to
consider a job cluster for scheduling.*
The command "condor_reschedule", issued on the submit host (i.e. where
the schedd is running), will do that. However, by default, this should
happen automatically every few minutes.
My jobs often sit unscheduled in the queue for many hours (indefinitely)
if I use condor_qedit to adjust job requirements.
To make sure jobs have enough RAM to run, I sometimes restrict allowed
SlotID range in requirements. There is probably a better way to do it:
i.e. somehow to declare RAM as a shared resource with certain number of
units of the resource available, but for now this is my quick hack to do
it. Setting ImageSize does not work since my jobs are almost always
bigger than per slot RAM and so if I give realistic job size, my jobs
would never start. Creating specialized slots is also a bad idea since
my jobs vary strongly in size.
The above sounds like pretty strange usage. As you suspect, there are
better ways to do this. Assuming you are using a current version of
HTCondor (i.e. HTCondor v8.6 or above), instead of configuring your
nodes to partition resources like memory into statically sized slots,
you could configure your nodes to use dynamic (partitionable) slots.
See the HTCondor Manual section "Dynamic Provisioning: Partitionable and
Dynamic Slots" at URL http://tinyurl.com/y83a9ufo. Once setup your
execute nodes to use a partitionable slot as described, then your
condor_submit file can look like:
executable = foo
# This job only needs one CPU core in the execute slot
request_cpus = 1
# This job needs 3.5 GB of RAM in the execute slot
request_memory = 3500
queue
and the execute node (startd) will carve off a new slot with 3.5GB of
memory for this job. No messing around with ImageSize required.
The problem is that often after such adjustment, my jobs would often
stop being scheduled for running – they sit in the queue indefinitely
and ‘condor_q -better-analyze clusterID’ gives “Job has not yet been
considered by the matchmaker.” while claiming that there are slots
“available to run your job”. If I do not use condor_qedit, jobs run
fine. If I kill the same jobs and then submit them again with new
requirements, they also run fine.
This sounds pretty strange. Can you easily reproduce it? Does it
happen every time or only sometimes? What version of HTCondor are you
using, on what platform?
regards,
Todd