[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] killing globus-job-managers



Steven Timm wrote:
>>>I usually look to see if the corresponding job has exited the queue
>>>already.  If so, there's no harm in killing it.
>>>Even if the job hasn't exited, condor-g will restart another
>>>jobmanager when it needs to.
>>
>>How do you find the corresponding job?  I didn't see anything obvious in
>>condor_q -l that would indicate which job they are attached to.  And in
>>many cases, there are more g-j-m jobs than user jobs.
> 
> 
> Look in /var/log/messages. For every job there are three lines of gridinfo
> including one that says
> Jul 19 18:45:16 fngp-osg gridinfo[31672]: JMA 2006/07/19 18:45:16 
> GATEKEEPER_JM_
> ID 2006-07-19.18:45:12.0000031649.0000000000 has GRAM_SCRIPT_JOB_ID 450589 
> manag
> er type managedfork
> 
> This ties the condor job id of the managedfork job local universe 450589,
> to the process id of the globus-job-manager process, namely 31672.

That's really useful.  Thanks!

> Also look at /home/<userid>/gram_job_mgr_31672.log
> that may give you some idea as to why the process isn't exiting.

I don't see anything obvious, other than the following segments getting
repeated every 15 seconds or so:

7/20 13:35:10 Job Manager State Machine (entering):
GLOBUS_GRAM_JOB_MANAGER_STATE_POLL2 7/20 13:35:10 JMI: testing job
manager scripts for type managedfork exist and permissions are ok. 7/20
13:35:10 JMI: completed script validation: job manager type is
managedfork.7/20 13:35:10 JMI: in globus_gram_job_manager_poll() 7/20
13:35:10 JMI: local stdout filename =
/home/uscms66/.globus/job/cithep67.ultralight.org/11940.1153422970/stdout.
7/20 13:35:10 JMI: local stderr filename = /dev/null.
7/20 13:35:10 JMI: poll: seeking:
https://cithep67.ultralight.org:42795/11940/1153422970/
7/20 13:35:10 JMI: poll_fast: ******** Failed to find
https://cithep67.ultralight.org/11940/1153422970/ 7/20 13:35:10 JMI:
poll_fast: returning -1 = GLOBUS_FAILURE (try Perl scripts) 7/20
13:35:10 JMI: cmd = poll 7/20 13:35:10 JMI: returning with success Thu
Jul 20 13:35:14 2006 JM_SCRIPT: New Perl JobManager created. Thu Jul 20
13:35:14 2006 JM_SCRIPT: Using jm supplied job dir:
/home/uscms66/.globus/job/cithep67.ultralight.org/11940.1153422970
Thu Jul 20 13:35:14 2006 JM_SCRIPT: polling job 74427 7/20 13:35:14 JMI:
while return_buf = GRAM_SCRIPT_JOB_STATE = 1 7/20 13:35:14 Job Manager
State Machine (entering): GLOBUS_GRAM_JOB_MANAGER_STATE_POLL1

The 'Failed to find https...' does looka little odd, since that url goes
to our local apache https server, not anything associated with globus or
the VDT.

>>>but if there are that many grid monitor jobs getting hung,
>>>then there's some bad issue on the client machine that is sending them to
>>>you.  Those jobs don't hang on their own.  iptables or quota
>>>or something.
>>
>>If this is true, then that is really bad because it means that issues on
>>the client side can quite easily take down our gatekeeper.  But I
>>suspect that there is still some configuration problem on our
>>gatekeeper, because we see these extra g-j-m processes and grid monitor
>>jobs regardless of the user.  The problem is, I still don't understand
>>the operation of the grid monitor enough to diagnose this, and haven't
>>found any decent documentation describing it in detail.  It's quickly
>>getting frustrating.  :(
>>
> 
> 
> The condor manual has gotten better but still not perfect.
> The grid_monitor.sh script is what gets submitted as the monitoring job.
> You can submit it manually.  In the archives of the OSG lists there are
> instructions on how to do that.
> 
> But check your iptables.  it is often a culprit in problems like these.

I'm still looking, but I haven't seen anything obvious.

--Mike

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature