Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] schedd problems?
- Date: Thu, 24 Feb 2005 11:01:16 -0600 (CST)
- From: Paul Armor <parmor@xxxxxxxxxxxxxxxxxxxx>
- Subject: RE: [Condor-users] schedd problems?
Hi,
On Thu, 24 Feb 2005, Ian Chesal wrote:
> > Hi,
> > I've got a strange problem (aren't they all?), and could use
> > guidance on how to figure out what's wrong. I have a submit
> > machine that can no longer tell what jobs are in it's own
> > queue. I upgraded condor to 6.7.3 (from 6.6.7) on Feb 10;
> > yesterday (Feb 23), it was noticed that condor_q would return:
> >
> > -- Failed to fetch ads from: <129.89.201.232:38456> :
> > hydra.phys.uwm.edu
> >
> > SchedLog doesn't seem to show anything interesting...
> >
> > How can I debug what's failing?
>
> Hi Paul,
>
> We've seen similar messages when a single schedd instance has LOTS of
> ports open in the 6.7.3 builds. Can you check the number of open network
> connections on the machine?
Nothing out of the ordinary...
> Is the schedd currently preempting a lot of
> startd machines in your cluster?
No, schedd seems to be permanantly out to lunch at the moment...
It appears to have been restarted a few days ago, after which it
immediately marked a bunch (I'm assuming "a bunch" = all) jobs at IDLE,
and since then just sits there. I've tried restarting condor on this
submit machine and got this:
2/24 10:07:07 ******************************************************
2/24 10:07:07 ** condor_schedd (CONDOR_SCHEDD) STARTING UP
2/24 10:07:07 ** /opt/condor/sbin/condor_schedd
2/24 10:07:07 ** $CondorVersion: 6.7.3 Dec 28 2004 $
2/24 10:07:07 ** $CondorPlatform: I386-LINUX_RH9 $
2/24 10:07:07 ** PID = 16027
2/24 10:07:07 ******************************************************
2/24 10:07:07 Using config file: /etc/condor/condor_config
2/24 10:07:07 Using local config files:
/opt/condor/home/condor_config.local
2/24 10:07:07 DaemonCore: Command Socket at <129.89.201.232:38456>
2/24 10:07:07 SEC_DEFAULT_SESSION_DURATION is undefined, using default
value of 3600
2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value
of 0
2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value
of 0
2/24 10:07:07 Will use UDP to update collector condor.medusa.phys.uwm.edu
<129.89.201.238:9618>
2/24 10:07:07 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default value
of 0
2/24 10:07:07 Using name: hydra.phys.uwm.edu
2/24 10:07:07 No Accountant host specified in config file
2/24 10:07:07 SCHEDD_MIN_INTERVAL is undefined, using default value of 5
2/24 10:07:07 JOB_START_COUNT is undefined, using default value of 1
2/24 10:07:07 MAX_JOBS_SUBMITTED is undefined, using default value of
2147483647
2/24 10:07:07 STARTD_CONTACT_TIMEOUT is undefined, using default value of
45
2/24 10:07:07 Queue Management Super Users:
2/24 10:07:07 root
2/24 10:07:07 condor
2/24 10:07:13 About to truncate log /opt/condor/home/spool/job_queue.log
2/24 10:07:14 Marked job 104860.0 as IDLE
2/24 10:07:14 Marked job 104806.0 as IDLE
2/24 10:07:14 Marked job 104761.0 as IDLE
2/24 10:07:14 Marked job 105652.0 as IDLE
>
> - Ian
>
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
--
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+ UWM-LSC Group Systems Administrator parmor@xxxxxxxxxxxxxxxxxxxx +
+ Physics 462 +
+ U. of W. - Milwaukee +
+ PO Box 413 414-229-2677 +
+ Milwaukee, WI 53201 fax 414-229-5589 +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++