Steven Timm wrote: >>>I usually look to see if the corresponding job has exited the queue >>>already. If so, there's no harm in killing it. >>>Even if the job hasn't exited, condor-g will restart another >>>jobmanager when it needs to. >> >>How do you find the corresponding job? I didn't see anything obvious in >>condor_q -l that would indicate which job they are attached to. And in >>many cases, there are more g-j-m jobs than user jobs. > > > Look in /var/log/messages. For every job there are three lines of gridinfo > including one that says > Jul 19 18:45:16 fngp-osg gridinfo[31672]: JMA 2006/07/19 18:45:16 > GATEKEEPER_JM_ > ID 2006-07-19.18:45:12.0000031649.0000000000 has GRAM_SCRIPT_JOB_ID 450589 > manag > er type managedfork > > This ties the condor job id of the managedfork job local universe 450589, > to the process id of the globus-job-manager process, namely 31672. That's really useful. Thanks! > Also look at /home/<userid>/gram_job_mgr_31672.log > that may give you some idea as to why the process isn't exiting. I don't see anything obvious, other than the following segments getting repeated every 15 seconds or so: 7/20 13:35:10 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2 7/20 13:35:10 JMI: testing job manager scripts for type managedfork exist and permissions are ok. 7/20 13:35:10 JMI: completed script validation: job manager type is managedfork.7/20 13:35:10 JMI: in globus_gram_job_manager_poll() 7/20 13:35:10 JMI: local stdout filename = /home/uscms66/.globus/job/cithep67.ultralight.org/11940.1153422970/stdout. 7/20 13:35:10 JMI: local stderr filename = /dev/null. 7/20 13:35:10 JMI: poll: seeking: https://cithep67.ultralight.org:42795/11940/1153422970/ 7/20 13:35:10 JMI: poll_fast: ******** Failed to find https://cithep67.ultralight.org/11940/1153422970/ 7/20 13:35:10 JMI: poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) 7/20 13:35:10 JMI: cmd = poll 7/20 13:35:10 JMI: returning with success Thu Jul 20 13:35:14 2006 JM_SCRIPT: New Perl JobManager created. Thu Jul 20 13:35:14 2006 JM_SCRIPT: Using jm supplied job dir: /home/uscms66/.globus/job/cithep67.ultralight.org/11940.1153422970 Thu Jul 20 13:35:14 2006 JM_SCRIPT: polling job 74427 7/20 13:35:14 JMI: while return_buf = GRAM_SCRIPT_JOB_STATE = 1 7/20 13:35:14 Job Manager State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1 The 'Failed to find https...' does looka little odd, since that url goes to our local apache https server, not anything associated with globus or the VDT. >>>but if there are that many grid monitor jobs getting hung, >>>then there's some bad issue on the client machine that is sending them to >>>you. Those jobs don't hang on their own. iptables or quota >>>or something. >> >>If this is true, then that is really bad because it means that issues on >>the client side can quite easily take down our gatekeeper. But I >>suspect that there is still some configuration problem on our >>gatekeeper, because we see these extra g-j-m processes and grid monitor >>jobs regardless of the user. The problem is, I still don't understand >>the operation of the grid monitor enough to diagnose this, and haven't >>found any decent documentation describing it in detail. It's quickly >>getting frustrating. :( >> > > > The condor manual has gotten better but still not perfect. > The grid_monitor.sh script is what gets submitted as the monitoring job. > You can submit it manually. In the archives of the OSG lists there are > instructions on how to do that. > > But check your iptables. it is often a culprit in problems like these. I'm still looking, but I haven't seen anything obvious. --Mike
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature