[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Weird behaviour of condor



> I first submitted a job that started a lot of trouble in my pool...
> 
> --> Now I changed this file, ...
> 
> But Now, here is the BIG problem : it's impossible to have condor_q,
or
> even submit any new job :
> guiot@chagall:$ condor_q
> 
> -- Failed to fetch ads from: <193.49.27.24:35171> :
chagall.galaxy.ibpc.fr
> 
> guiot@chagall$
> 
> If I try to submit a job, it keeps telling "Submitting job(s)", but
> nothing happens.
> 
> I tried to restarts condor on the submit machine, nothing happens...
> 
> Any idea to get me out of this s*** ?
> Some log files, if it can help (job 123 is THE job that started all
the
> problems...):

What time did you try this, so that we can match it up with your schedd
logs?

> 10/24 17:17:27 (pid:3971) DaemonCore: Command received via TCP from
host
> <193.49.27.24:59523>
> 10/24 17:17:27 (pid:3971) DaemonCore: received command 478
(ACT_ON_JOBS),
> calling handler (actOnJobs)
> 10/24 18:15:28 (pid:14639)
> ******************************************************
> 10/24 18:15:28 (pid:14639) ** condor_schedd (CONDOR_SCHEDD) STARTING
UP
> 10/24 18:15:28 (pid:14639) ** /ibpc/io/condor/sbin/condor_schedd
> 10/24 18:15:28 (pid:14639) ** $CondorVersion: 6.7.10 Aug  3 2005 $
> 10/24 18:15:28 (pid:14639) ** $CondorPlatform: I386-LINUX_RH9 $
> 10/24 18:15:28 (pid:14639) ** PID = 14639
> 10/24 18:15:28 (pid:14639)
> ******************************************************
> 10/24 18:15:28 (pid:14639) Using config file:
> /ibpc/io/condor/etc/condor_config
> 10/24 18:15:28 (pid:14639) Using local config files:
> /scratch/condor/condor_config.local
> 10/24 18:15:28 (pid:14639) DaemonCore: Command Socket at
> <193.49.27.24:59931>
> 10/25 11:15:55 (pid:2479)
> ******************************************************
> 10/25 11:15:55 (pid:2479) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 10/25 11:15:55 (pid:2479) ** /ibpc/io/condor/sbin/condor_schedd
> 10/25 11:15:55 (pid:2479) ** $CondorVersion: 6.7.10 Aug  3 2005 $
> 10/25 11:15:55 (pid:2479) ** $CondorPlatform: I386-LINUX_RH9 $
> 10/25 11:15:55 (pid:2479) ** PID = 2479
> 10/25 11:15:55 (pid:2479)
> ******************************************************
> 10/25 11:15:55 (pid:2479) Using config file:
> /ibpc/io/condor/etc/condor_config
> 10/25 11:15:55 (pid:2479) Using local config files:
> /scratch/condor/condor_config.local
> 10/25 11:15:55 (pid:2479) DaemonCore: Command Socket at
> <193.49.27.24:35171>

Your schedd is crashing, and you were probably trying to do stuff to it
when it wasn't there.  (Hence the failures and timeouts.)  I find it
very odd, though, that there is such a large delay between the last log
message before the crash and the startup banner.  For example, 10/24
17:17:27 to 10/24 18:15:28, and 10/24 18:15:28 to 10/25 11:15:55.  The
condor_master should restart the schedd automatically when it dies like
this.  What does your MasterLog say?  What about 'ps -ef | grep condor'?

As to why your schedd is crashing...I don't know.  If it crashes every
time upon startup, then what's happening is that the schedd is reading
its job_queue.log file to rebuild the job queue, hitting some error, and
dying. Since the job_queue.log file isn't changing, the schedd will die
every time.

The (somewhat drastic) solution is:
- Turn off condor 
- Move the job_queue.log file (it's in your spool directory) to another
file, like job_queue.log.crashing.  Note that you'll completely lose the
current state of the job queue.
- Start up condor.  

If it starts up as normal (creating a new job_queue.log file), you may
want to send the old log file to the condor team so that they can
reproduce your problem.

Best of luck,
Mike Yoder
Principal Member of Technical Staff
Ask Mike: http://docs.optena.com
Direct  : +1.408.321.9000
Fax     : +1.408.321.9030
Mobile  : +1.408.497.7597
yoderm@xxxxxxxxxx

Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
http://www.optena.com