Steven Timm wrote: > On Wed, 19 Jul 2006, Michael Thomas wrote: > > >>Once again I started seeing high loads on my gatekeeper due to a large >>number of globus-job-manager processes. >> >>I started to kill some of the older (> 1 day) g-j-m processes and saw an >>immediate reduction in the system load, as I had expected. > > > Oftentimes you just need to find the right one that is hung and > then all the rest of them will clear out on their own. > This is especially true when you do ps auxwww and see some in state D, > waiting on nfs I/O. Finding that one or two that are hung is not that easy, when it appears that most of them are hung. pstree doesn't show any tree output, just a list of unrelated globus-job-manager jobs. Even if I manage to kill 50% of the supposedly hung g-j-m processes, the rest aren't able to clear out on their own because there are so darned many for all users, and more g-j-m processes keep coming back. >>My question: Is it ok to start arbitrarily killing some of these g-j-m >>processes? What effect will it have on the corresponding jobs? > > > I usually look to see if the corresponding job has exited the queue > already. If so, there's no harm in killing it. > Even if the job hasn't exited, condor-g will restart another > jobmanager when it needs to. How do you find the corresponding job? I didn't see anything obvious in condor_q -l that would indicate which job they are attached to. And in many cases, there are more g-j-m jobs than user jobs. >>Would it be better/equivalent to condor_rm some of the g-j-m jobs (which >>are easily identified by their command: data --dest-url=http://...)? >>What effect will condor_rm'ing these jobs have for the user? >> > > These are grid monitor jobs. they should never under any circumstance > last more than one hour. If they do something is really wrong. Then something is really wrong. > Cancelling them it will have no effect on whether the user's jobs execute > or not, just on what is reported to his condor-g client. Then I think I want to be careful about killing them, as accurate reporting is important for us. >>Some users are correctly setting >>GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE to limit the number of g-j-m >>processes. Is there an equivalent setting that I can use on the >>gatekeeper to limit the number of g-j-m processes launched by any given >>user? >> > > Condor-G should be doing the right thing even if that setting isn't > being used, and only running one per user at a time. > You can use a setting START_LOCAL_UNIVERSE > if you are using the managedfork job manager, which you must be > given what you are saying here. That controls how many can > simultaneously start at once. I don't see how the OSG-recommended value for START_LOCAL_UNIVERSE will limit the number of grid monitor jobs: START_LOCAL_UNIVERSE = TotalLocalJobsRunning < 20 || GridMonitorJob == TRUE Is there some other value that should be here to limit it to one per user? > but if there are that many grid monitor jobs getting hung, > then there's some bad issue on the client machine that is sending them to > you. Those jobs don't hang on their own. iptables or quota > or something. If this is true, then that is really bad because it means that issues on the client side can quite easily take down our gatekeeper. But I suspect that there is still some configuration problem on our gatekeeper, because we see these extra g-j-m processes and grid monitor jobs regardless of the user. The problem is, I still don't understand the operation of the grid monitor enough to diagnose this, and haven't found any decent documentation describing it in detail. It's quickly getting frustrating. :( --Mike
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature