[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Resending: Solaris 10 - All jobs idling for ever...



> 	What do the ShadowLog (on the local machine)
> 
> I'm getting this:
> 
> bgoncal@lab1a> condor_config_val SHADOW_LOG
> /home/condor/hosts/lab1a/log/ShadowLog
> bgoncal@lab1a> more /home/condor/hosts/lab1a/log/ShadowLog
> /home/condor/hosts/lab1a/log/ShadowLog: No such file or directory
> bgoncal@lab1a>

Oh!  That changes things.  (More below)

> 9/17 06:55:46 DaemonCore: received command 440 (MATCH_INFO), calling
> handler (command_match_info)
> 9/17 06:55:46 vm1: match_info called
> 9/17 06:55:46 vm1: Received match
<170.140.151.126:38590>#1126713130#988
> 9/17 06:55:46 vm1: State change: match notification protocol
successful
> 9/17 06:55:46 vm1: Changing state: Unclaimed -> Matched
> 9/17 06:55:47 DaemonCore: Command received via UDP from host
> <170.140.151.110:35503>
> 9/17 06:55:47 DaemonCore: received command 440 (MATCH_INFO), calling
> handler (command_match_info)
> 9/17 06:55:47 vm2: match_info called
> 9/17 06:55:47 vm2: Received match
<170.140.151.126:38590>#1126713130#989
> 9/17 06:55:47 vm2: State change: match notification protocol
successful
> 9/17 06:55:47 vm2: Changing state: Unclaimed -> Matched
> 9/17 06:55:52 DaemonCore: Command received via UDP from host
> <170.140.151.110:35618>
> 9/17 06:55:52 DaemonCore: received command 443 (RELEASE_CLAIM),
calling
> handler (command_release_claim)
> 9/17 06:55:52 vm1: State change: received RELEASE_CLAIM command
> 9/17 06:55:52 vm1: Changing state: Matched -> Owner
> 9/17 06:55:52 vm1: State change: IS_OWNER is false
> 9/17 06:55:52 vm1: Changing state: Owner -> Unclaimed

This means that the schedd is being matched with the startd, but for
some reason the startd is getting a command (RELEASE_CLAIM) and letting
go of that match.  This is happening before the job even thinks of
starting.

Can you give us more from your SchedLog?  (condor_config_val SCHEDD_LOG)
(On the submit machine.)

Can you provide us with an IP -> machine mapping for the above?  What
roles do 170.140.151.126 and 170.140.151.110 play?

> All the daemons mentioned in ... are running:
> 
> I'm probably missing something obvious, but I have no idea what it
might
> be... :(

It's not obvious yet. :-)

What interesting values have you modified in your condor_config file(s),
if any?

Mike Yoder
Principal Member of Technical Staff
Ask Mike: http://docs.optena.com
Direct  : +1.408.321.9000
Fax     : +1.408.321.9030
Mobile  : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com