[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Efficiency & centralization of global information gathering?



I notice that the collector shows ConcurrencyLimits attributes in the machine ads for slots which are running ConcurrencyLimited jobs:

 

condor1$ condor_status -any -constraint '!isUndefined(ConcurrencyLimits)' -af MyType ConcurrencyLimits

Machine matlab_dce

Machine matlab_dce

Machine matlab_dce

Machine matlab_dce,testsys:2

Machine matlab_dce,testsys:2

condor1$

 

So maybe the trick to get the negotiator to recognize while-running limit claims is to update the machine ad rather than the job ad? Or update both?

 

The job canât update the machine (startd) ad with condor_chirp once itâs ready for the licensed step, so maybe the job could set a flag attribute in its own ad such as âNeedMachineConcurrencyLimitsUpdate = Trueâ along with the job adâs ConcurrencyLimits, and the Boolean could be monitored by schedd_cron on the Central Manager which would update the machine ad ConcurrencyLimits attribute for the jobâs slot, as identified by the GlobalJobId attribute, to match the ConcurrencyLimits string in the job ad, then set âNeedMachineConcurrencyLimitsUpdate = Falseâ which the waiting job would notice and then proceed with the licensed step. (Only after the next negotiation cycle?)

 

Since the concurrency limit would be already set before the licensed step begins, the only risk would be an non-Condor job grabbing the license before the job noticed its concurrency limit update request had been accepted (but only when a FlexLM feature is shared between Condor and non-Condor users), and the window for this issue could be a bit long since you wouldnât want the job to spam queries of NeedMachineConcurrencyLimitsUpdate too often while waiting â but you could check the license count with an lmstat just before starting the licensed step and wait/poll if it showed zero.

 

That said, itâs probably easier to just drag the users into condor_dagman for most things as you did, but in continuous-integration build worker jobs, for example, you either have to do the licensed step within the framework of the build worker, or have the worker sitting around in a slot doing nothing at all while waiting for another job submission it generated with the license concurrency limit to finish running the licensed step.

 

                -Michael Pelletier.

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Edward Labao
Sent: Tuesday, January 10, 2017 12:07 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Efficiency & centralization of global information gathering?

 

Hi Michael!

We ran into the same issue a few years ago with user jobs tying up a particularly scarce license for hours before they were actually used. We tested the exact same thing you're thinking of by just running a condor_qedit on a long running job to update it's concurrency limit attribute, but it didn't look like the negotiator ever gets an update of the concurrency limit.

In the end, we had to ask our users rework their submissions so that the process that actually required licenses was split off into it's own job, which is probably for the best in the long run. This wasn't too bad because we use condor_dagman to implement job dependencies so it's easy to chain jobs together.

Cheers!