Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] schedd problems?
- Date: Thu, 24 Feb 2005 14:00:36 -0600 (CST)
- From: Paul Armor <parmor@xxxxxxxxxxxxxxxxxxxx>
- Subject: RE: [Condor-users] schedd problems?
Hi,
OK, it's working again (pardon the vagueness of that message, I still
don't understand what went wrong).
>From what I can tell, condor_master thinks it restarted schedd several
days ago on this submit machine. Except that schedd never really started
(at least it didn't do anything when it did). All I did today was stop -
wait (apparantly not long enough for all processes to clean up) - start
several times, and it finally worked.
Does anyone have any ideas on how I can figure out: A. why it stopped and
B. why it started (after several restarts)?
Thanks!
Paul
On Thu, 24 Feb 2005, Paul Armor wrote:
> Hi,
>
> On Thu, 24 Feb 2005, Ian Chesal wrote:
> > > Hi,
> > > I've got a strange problem (aren't they all?), and could use
> > > guidance on how to figure out what's wrong. I have a submit
> > > machine that can no longer tell what jobs are in it's own
> > > queue. I upgraded condor to 6.7.3 (from 6.6.7) on Feb 10;
> > > yesterday (Feb 23), it was noticed that condor_q would return:
> > >
> > > -- Failed to fetch ads from: <129.89.201.232:38456> :
> > > hydra.phys.uwm.edu
> > >
> > > SchedLog doesn't seem to show anything interesting...
> > >
> > > How can I debug what's failing?
> >
> > Hi Paul,
> >
> > We've seen similar messages when a single schedd instance has LOTS of
> > ports open in the 6.7.3 builds. Can you check the number of open network
> > connections on the machine?
>
> Nothing out of the ordinary...
>
> > Is the schedd currently preempting a lot of
> > startd machines in your cluster?
>
> No, schedd seems to be permanantly out to lunch at the moment...
>
> It appears to have been restarted a few days ago, after which it
> immediately marked a bunch (I'm assuming "a bunch" = all) jobs at IDLE,
> and since then just sits there. I've tried restarting condor on this
> submit machine and got this:
>
> 2/24 10:07:07 ******************************************************
> 2/24 10:07:07 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
> 2/24 10:07:07 ** /opt/condor/sbin/condor_schedd
> 2/24 10:07:07 ** $CondorVersion: 6.7.3 Dec 28 2004 $
> 2/24 10:07:07 ** $CondorPlatform: I386-LINUX_RH9 $
> 2/24 10:07:07 ** PID = 16027
> 2/24 10:07:07 ******************************************************
> 2/24 10:07:07 Using config file: /etc/condor/condor_config
> 2/24 10:07:07 Using local config files:
> /opt/condor/home/condor_config.local
> 2/24 10:07:07 DaemonCore: Command Socket at <129.89.201.232:38456>
> 2/24 10:07:07 SEC_DEFAULT_SESSION_DURATION is undefined, using default
> value of 3600
> 2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value
> of 0
> 2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value
> of 0
> 2/24 10:07:07 Will use UDP to update collector condor.medusa.phys.uwm.edu
> <129.89.201.238:9618>
> 2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value
> of 0
> 2/24 10:07:07 Using name: hydra.phys.uwm.edu
> 2/24 10:07:07 No Accountant host specified in config file
> 2/24 10:07:07 SCHEDD_MIN_INTERVAL is undefined, using default value of 5
> 2/24 10:07:07 JOB_START_COUNT is undefined, using default value of 1
> 2/24 10:07:07 MAX_JOBS_SUBMITTED is undefined, using default value of
> 2147483647
> 2/24 10:07:07 STARTD_CONTACT_TIMEOUT is undefined, using default value of
> 45
> 2/24 10:07:07 Queue Management Super Users:
> 2/24 10:07:07 root
> 2/24 10:07:07 condor
> 2/24 10:07:13 About to truncate log /opt/condor/home/spool/job_queue.log
> 2/24 10:07:14 Marked job 104860.0 as IDLE
> 2/24 10:07:14 Marked job 104806.0 as IDLE
> 2/24 10:07:14 Marked job 104761.0 as IDLE
> 2/24 10:07:14 Marked job 105652.0 as IDLE
>
>
>
> >
> > - Ian
> >
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
>
>
--
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ UWM-LSC Group Systems Administrator parmor@xxxxxxxxxxxxxxxxxxxx +
+ Physics 462 +
+ U. of W. - Milwaukee +
+ PO Box 413 414-229-2677 +
+ Milwaukee, WI 53201 fax 414-229-5589 +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++