Hi Steve,strace shows that there are quite a number of the perl globus-job-manager scripts that are hanging during a read operation. both lsof and /proc confirm that it's blocking while trying to read from a pipe (FIFO), but I am not sure how to figure out what's connected to the other end of the pipe.
In any case, I've started to kill these hanging perl scripts to see if it helps clear things up.
streaming is disabled, and the grid monitor is enabled. Thanks for the tip, --Mike Steven Timm wrote:
There are two types of globus-job-manager processes using the globus which condor refers to as "gt2". One is a jobmanager-condor script which stays live as long as the job is live. The second is a globus-job-manager-script perl script which runs once per minute and is forked off from the main jobmanager-condor. Do pstree, I think you will see several others are children of the one. I've seen this happen in cases where there is some problem with NFS and the globus-job-manager is stuck trying to delete or undelete a hard link across NFS. strace will tell you if this is the case. Oftentimes it can be just one process that is hung waiting for some nfs file and once you kill that process the rest of them will clear out. As to why there is at least one globus-job-manager per condor job, there are several reasons why it could be. Did you disable streaming? You have to for the grid monitor to work. Do you see jobmanager-forks trying to start from your client node to your head node? Those are needed to start up the grid monitor. IS ENABLE_GRID_MONITOR set to TRUE in your condor_config file? It needs to be. In the archives of this list there is a procedure to try to start up the grid monitor manually, to see if you have any problems that are blocking the automatic condor-G start. Steve ------------------------------------------------------------------ Steven C. Timm, Ph.D (630) 840-8525 timm@xxxxxxxx http://home.fnal.gov/~timm/ Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section Assistant Group Leader, Farms and Clustered Systems Group Lead of Computing Farms Team On Thu, 29 Jun 2006, Michael Thomas wrote:I forgot to mention that this is using condor 6.7.18, and there are > 1300 jobs in the queue right now (all but 200 are idle).--Mike Michael Thomas wrote:While doing some stress testing on our 200-node cluster using condor-g, we have noticed some extremely large loads on the cluster. The large load seems to be caused by 500+ globus-job-manager processes, with sometimes 2 or 3 globus-job-manager processes for each job.condor_config contains the line: GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE = 10 ...but that seems to be ignored.Why would we have multiple globus-job-managers for a single job, and what can we do to reduce the number of globus-job-manager processes so that our gatekeeper doesn't get quite so overloaded?--Mike
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature