[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] schedd problems?



Hi,
OK, it's working again (pardon the vagueness of that message, I still 
don't understand what went wrong).

>From what I can tell, condor_master thinks it restarted schedd several 
days ago on this submit machine.  Except that schedd never really started 
(at least it didn't do anything when it did).  All I did today was stop - 
wait (apparantly not long enough for all processes to clean up) - start 
several times, and it finally worked.

Does anyone have any ideas on how I can figure out: A. why it stopped and 
B. why it started (after several restarts)?

Thanks!
Paul


On Thu, 24 Feb 2005, Paul Armor wrote:

> Hi,
> 
> On Thu, 24 Feb 2005, Ian Chesal wrote:
> > > Hi,
> > > I've got a strange problem (aren't they all?), and could use 
> > > guidance on how to figure out what's wrong.  I have a submit 
> > > machine that can no longer tell what jobs are in it's own 
> > > queue.  I upgraded condor to 6.7.3 (from 6.6.7) on Feb 10; 
> > > yesterday (Feb 23), it was noticed that condor_q would return:
> > > 
> > > -- Failed to fetch ads from: <129.89.201.232:38456> : 
> > > hydra.phys.uwm.edu
> > > 
> > > SchedLog doesn't seem to show anything interesting...
> > > 
> > > How can I debug what's failing?
> > 
> > Hi Paul,
> > 
> > We've seen similar messages when a single schedd instance has LOTS of
> > ports open in the 6.7.3 builds. Can you check the number of open network
> > connections on the machine?
> 
> Nothing out of the ordinary...
> 
> > Is the schedd currently preempting a lot of
> > startd machines in your cluster?
> 
> No, schedd seems to be permanantly out to lunch at the moment...
> 
> It appears to have been restarted a few days ago, after which it 
> immediately marked a bunch (I'm assuming "a bunch" = all) jobs at IDLE, 
> and since then just sits there.  I've tried restarting condor on this 
> submit machine and got this:
> 
> 2/24 10:07:07 ******************************************************
> 2/24 10:07:07 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 2/24 10:07:07 ** /opt/condor/sbin/condor_schedd
> 2/24 10:07:07 ** $CondorVersion: 6.7.3 Dec 28 2004 $
> 2/24 10:07:07 ** $CondorPlatform: I386-LINUX_RH9 $
> 2/24 10:07:07 ** PID = 16027
> 2/24 10:07:07 ******************************************************
> 2/24 10:07:07 Using config file: /etc/condor/condor_config
> 2/24 10:07:07 Using local config files: 
> /opt/condor/home/condor_config.local
> 2/24 10:07:07 DaemonCore: Command Socket at <129.89.201.232:38456>
> 2/24 10:07:07 SEC_DEFAULT_SESSION_DURATION is undefined, using default 
> value of 3600
> 2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value 
> of 0
> 2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value 
> of 0
> 2/24 10:07:07 Will use UDP to update collector condor.medusa.phys.uwm.edu 
> <129.89.201.238:9618>
> 2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value 
> of 0
> 2/24 10:07:07 Using name: hydra.phys.uwm.edu
> 2/24 10:07:07 No Accountant host specified in config file
> 2/24 10:07:07 SCHEDD_MIN_INTERVAL is undefined, using default value of 5
> 2/24 10:07:07 JOB_START_COUNT is undefined, using default value of 1
> 2/24 10:07:07 MAX_JOBS_SUBMITTED is undefined, using default value of 
> 2147483647
> 2/24 10:07:07 STARTD_CONTACT_TIMEOUT is undefined, using default value of 
> 45
> 2/24 10:07:07 Queue Management Super Users:
> 2/24 10:07:07   root
> 2/24 10:07:07   condor
> 2/24 10:07:13 About to truncate log /opt/condor/home/spool/job_queue.log
> 2/24 10:07:14 Marked job 104860.0 as IDLE
> 2/24 10:07:14 Marked job 104806.0 as IDLE
> 2/24 10:07:14 Marked job 104761.0 as IDLE
> 2/24 10:07:14 Marked job 105652.0 as IDLE
> 
> 
> 
> > 
> > - Ian
> > 
> > 
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> 
> 

-- 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ UWM-LSC Group Systems Administrator        parmor@xxxxxxxxxxxxxxxxxxxx +
+ Physics 462                                                            +
+ U. of W. - Milwaukee                                                   +
+ PO Box 413                                                414-229-2677 +
+ Milwaukee, WI 53201                                   fax 414-229-5589 +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++