[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] reached max workers



On 2/6/2018 12:01 PM, Michael Pelletier wrote:
> Hm, perhaps this is something different then. The query workers are used for condor_q and condor_status, so you shouldn't see them invoked during a condor_submit as I understand it.
> 

Michael is correct ... as usual! :)

The config knobs COLLECTOR_QUERY_WORKERS and SCHEDD_QUERY_WORKERS just control how many times maximum the collector and schedd are allowed to fork when servicing a condor_status or condor_q query, respectively.  I suggest leaving them alone at their defaults. 

> When submitting jobs from python I some times get an error connecting to schedd. 

What is the error you see in Python?

Most common reason I have seen errors connecting to a schedd from Python is if a Python process attempts to open multiple connections to the same schedd (perhaps from multiple threads). Be aware that, at least for now, each Python process may only have one schedd transaction open at a time.  If you attempt to open a second schedd connection from the same process, it will fail.  If you attempt to open a second connection to a schedd from a different process, that second process will wait until the first process closes the connection.  As such, it is a good idea for your Python program to do minimal processing while the schedd connection is open, so that the connection may be closed as soon as possible.  

Another possibility your schedd is overloaded for some reason.  Each schedd classad will have an attribute "RecentDaemonCoreDutyCycle" that serves as a load metric of sorts.  If this value is greater than 0.98 (98%), that could be the problems.  To view this value for all your submit machines in your pool do

 condor_status -schedd -af name recentdaemoncoredutycycle

> Looking in the log I see this:
> 
> ForkWork: not forking because reached max workers 8

Apparently your schedd is receiving a lot of simultaneous query (or condor_q) calls, and/or a lot of very large queries.

Note that doing a query of a schedd that has a lot (many thousands) of submitted jobs and asking for all of the attributes is expensive.  I.e. if you just need a few job classad attributes like owner, do this:

   condor_q -all-users -af owner

instead of
  
   condor_q -all-users -l | grep -i owner

> 
> Thanks. The doc says default for SCHEDD_QUERY_WORKERS is 3, but I am not setting it and I get the error that the max is 8.
> 

Are you looking at the version of the manual that corresponds to the version of HTCondor you are running? Be warned that Google searches often end up pointing 
at ancient versions of the Manual.  
Current versions of the manual has the correct value; it looks like the documentation on this
value was updated 3+ years ago - see

http://research.cs.wisc.edu/htcondor/manual/current/3_5Configuration_Macros.html#25800

Also I suggest always checking values with condor_config_val -v, like so:

% condor_config_val -v schedd_query_workers
SCHEDD_QUERY_WORKERS = 8
 # at: <Default>
 # raw: SCHEDD_QUERY_WORKERS = 8

> In any case, if I have 264 compute nodes would I set that (and
> COLLECTOR_QUERY_WORKERS) to 264 so I could use them all simultaneously?
> 

Nope.  Again, I suggest you remove all references to these knobs and condor_reconfig.

Hope the above helps,

regards,
Todd