I usually look to see if the corresponding job has exited the queue
already. If so, there's no harm in killing it.
Even if the job hasn't exited, condor-g will restart another
jobmanager when it needs to.
How do you find the corresponding job? I didn't see anything obvious in
condor_q -l that would indicate which job they are attached to. And in
many cases, there are more g-j-m jobs than user jobs.
Would it be better/equivalent to condor_rm some of the g-j-m jobs (which
are easily identified by their command: data --dest-url=http://...)?
What effect will condor_rm'ing these jobs have for the user?
These are grid monitor jobs. they should never under any circumstance
last more than one hour. If they do something is really wrong.
Then something is really wrong.
Cancelling them it will have no effect on whether the user's jobs execute
or not, just on what is reported to his condor-g client.
Then I think I want to be careful about killing them, as accurate
reporting is important for us.
Some users are correctly setting
GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE to limit the number of g-j-m
processes. Is there an equivalent setting that I can use on the
gatekeeper to limit the number of g-j-m processes launched by any given
user?
Condor-G should be doing the right thing even if that setting isn't
being used, and only running one per user at a time.
You can use a setting START_LOCAL_UNIVERSE
if you are using the managedfork job manager, which you must be
given what you are saying here. That controls how many can
simultaneously start at once.
I don't see how the OSG-recommended value for START_LOCAL_UNIVERSE will
limit the number of grid monitor jobs:
START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 20 || GridMonitorJob == TRUE
Is there some other value that should be here to limit it to one per user?
but if there are that many grid monitor jobs getting hung,
then there's some bad issue on the client machine that is sending them to
you. Those jobs don't hang on their own. iptables or quota
or something.
If this is true, then that is really bad because it means that issues on
the client side can quite easily take down our gatekeeper. But I
suspect that there is still some configuration problem on our
gatekeeper, because we see these extra g-j-m processes and grid monitor
jobs regardless of the user. The problem is, I still don't understand
the operation of the grid monitor enough to diagnose this, and haven't
found any decent documentation describing it in detail. It's quickly
getting frustrating. :(