Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor_collector died (11) or exited (4)
- Date: Fri, 10 Jul 2009 12:15:22 +0800
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: [Condor-users] condor_collector died (11) or exited (4)
Hi
All
We've recently been
getting two of our Central Managers periodically
(every few hours?)
restarting
condor_collector after the daemon either
dies with signal 11
or exits with signal 4. All of our six CMs are running
condor-7.2.3. Five are running as VM's on ESX
servers and one is a physical
Dell PowerEdge 750.
Three of the VMs are running 64 bit SLES10, two
running 32 bit RHES4
and the one physical machine is running 32 bit RHES4.
All RH and SLES
machines have been cloned
from original setups (including
condor already being
installed).
Only 2 of the
machines are having these problems. These are one VM
running 64bit SLES10
and one VM running 32bit RHES4. We have restarted
condor on these
machines, as well as rebooting the machines themselves,
all to no
avail.
Extracts from logs
follow:
This is for the
SLES10 Machine died with signal 11
7/10 11:21:16
StartdAd : Inserting ** "<
CLW-FZ6JY1S-GW.nexus.csiro.au , 144
.110.17.30 >"
7/10 11:21:16
StartdPvtAd : Inserting ** "< CLW-FZ6JY1S-GW.nexus.csiro.au ,
144
.110.17.30 >"
7/10 11:21:32 Got INVALIDATE_STARTD_ADS
7/10
11:21:32 ****
Removing stale ad: "< 210087-NT.nexus.csiro.au , 13
0.155.34.194
>"
7/10
11:21:32 ****
Removing stale ad: "< PORTER-BE.nexus.csiro.au , 15
2.83.192.199
>"
7/10 11:21:32 (Invalidated 1 ads)
Stack dump for process 1862 at
timestamp 1247196092 (17
frames)
condor_collector(dprintf_dump_stack+0xb3)[0x5096e7]
condor_collector(_Z18linux_sig_coredumpi+0x28)[0x4ff36c]
/lib64/libc.so.6[0x2b95518c8e20]
condor_collector(_ZNK9HashTableI10YourStringP12AttrListElemE6lookupERKS0_RS2_+0x
18)[0x570f30]
condor_collector(_ZNK8AttrList6LookupEPKc+0x7b)[0x56d60d]
condor_collector(_ZNK8AttrList13LookupIntegerEPKcRi+0x21)[0x56db31]
condor_collector(_ZN15CollectorEngine14cleanHashTableER9HashTableI13AdNameHashKe
yP7ClassAdElPFbRS1_S3_P11sockaddr_inE+0x67)[0x4e424d]
condor_collector(_ZN15CollectorEngine17invokeHousekeeperE7AdTypes+0xf2)[0x4e1a22
]
condor_collector(_ZN15CollectorDaemon20process_invalidationE7AdTypesR7ClassAdP6S
tream+0x7c)[0x4cf940]
condor_collector(_ZN15CollectorDaemon20receive_invalidationEP7ServiceiP6Stream+0
x32e)[0x4cf0c2]
condor_collector(_ZN10DaemonCore9HandleReqEP6Stream+0x36db)[0x4f2f1d]
condor_collector(_ZN10DaemonCore9HandleReqEi+0x36)[0x4ef840]
condor_collector(_ZN10DaemonCore17CallSocketHandlerERib+0x2b3)[0x4ef2e9]
condor_collector(_ZN10DaemonCore6DriverEv+0x1463)[0x4eef21]
condor_collector(main+0x183f)[0x501d2f]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x2b95518b6164]
condor_collector(__strtoll_internal+0x5a)[0x4b716a]
7/10
11:21:42 ******************************************************
7/10 11:21:42
** condor_collector (CONDOR_COLLECTOR) STARTING UP
7/10 11:21:42 **
/usr/local/condor/sbin/condor_collector
7/10 11:21:42 ** SubsystemInfo:
name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1)
7/10 11:21:42 **
Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON
7/10
11:21:42 ** $CondorVersion: 7.2.3 May 11 2009 BuildID: 151729 $
7/10 11:21:42
** $CondorPlatform: X86_64-LINUX_RHEL3 $
7/10 11:21:42 ** PID = 22874
7/10
11:21:42 ** Log last touched 7/10 11:21:32
7/10 11:21:42
******************************************************
7/10 11:21:42 Using
config source: /home/condor/condor_config
7/10 11:21:42 Using local config
sources:
7/10 11:21:42
/home/condor/condor_config.local
This is for
the RHES4 Machine died with signal
11
7/10 13:28:30
(Sending 876 ads in response to query)
7/10 13:28:31 Got
QUERY_STARTD_PVT_ADS
7/10 13:28:31 (Sending 435 ads in response to
query)
7/10 13:28:37 NegotiatorAd : Inserting ** "<
condor-nsw.riverside.csiro.au >"
7/10 13:28:51
StartdAd : Inserting ** "< MILFORD-LN.tip.csiro.au ,
192.168.0
.1 >"
Stack dump for process 3209 at timestamp 1247196531 (17
frames)
condor_collector(dprintf_dump_stack+0xda)[0x81314b7]
condor_collector(_Z18linux_sig_coredumpi+0x23)[0x81275e3]
/lib/tls/libc.so.6[0x9c5918]
condor_collector(_ZNK8AttrList6LookupEPKc+0x70)[0x8190446]
condor_collector(_ZNK8AttrList13LookupIntegerEPKcRi+0x14)[0x819093a]
condor_collector(_ZN14CollectorStats6updateEPKcP7ClassAdS3_+0xa5)[0x8107073]
condor_collector(_ZN15CollectorEngine13updateClassAdER9HashTableI13AdNameHashKey
P7ClassAdEPKcS7_S3_RS1_RK8MyStringRiPK11sockaddr_in+0x19b)[0x810db0f]
condor_collector(_ZN15CollectorEngine7collectEiP7ClassAdP11sockaddr_inRiP4Sock+0
x396)[0x810c8e4]
condor_collector(_ZN15CollectorEngine7collectEiP4SockP11sockaddr_inRi+0x13d)[0x8
10c115]
condor_collector(_ZN15CollectorDaemon14receive_updateEP7ServiceiP6Stream+0x79)[0
x80f9a99]
condor_collector(_ZN10DaemonCore9HandleReqEP6Stream+0x37f7)[0x811bcb9]
condor_collector(_ZN10DaemonCore9HandleReqEi+0x2d)[0x81184bd]
condor_collector(_ZN10DaemonCore17CallSocketHandlerERib+0x280)[0x8117f5c]
condor_collector(_ZN10DaemonCore6DriverEv+0x1352)[0x8117bce]
condor_collector(main+0x1829)[0x812a03d]
/lib/tls/libc.so.6(__libc_start_main+0xd3)[0x9b2df3]
condor_collector(ldexp+0x59)[0x80e4db1]
7/10
13:29:02 ******************************************************
7/10 13:29:02
** condor_collector (CONDOR_COLLECTOR) STARTING UP
7/10 13:29:02 **
/usr/local/condor/sbin/condor_collector
7/10 13:29:02 ** SubsystemInfo:
name=COLLECTOR type=COLLECTOR(3) class=DAEMON(1)
7/10 13:29:02 **
Configuration: subsystem:COLLECTOR local:<NONE> class:DAEMON
7/10
13:29:02 ** $CondorVersion: 7.2.3 May 11 2009 BuildID: 151729 $
7/10 13:29:02
** $CondorPlatform: I386-LINUX_RHEL3 $
7/10 13:29:02 ** PID = 10448
7/10
13:29:02 ** Log last touched 7/10 13:28:51
7/10 13:29:02
******************************************************
7/10 13:29:02 Using
config source: /home/condor/condor_config
7/10 13:29:02 Using local config
sources:
7/10 13:29:02
/home/condor/condor_config.local
When they "exit"
rather than "die" both give a line as below for exiting with signal
4
7/9 16:36:35 ERROR "Assertion ERROR on (hash)" at
line 1073 in file attrlist.cpp
As mentioned this is only happening for 2 out of the 6
servers, all of which
"should" be
identical.
Thanks for any
help
Cheers
Greg