Hi,
Occasionally, I am seeing problems in our cluster where a node or two drop out and I am unable to reconnect the node into the pool by condor_restart. This is what I see in the CollectorLog in my condor host when I issued a condor_restart to one of the
dropped out nodes (192.168.56.104, or srv03.hpc-dev.spookfish.com):
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:51449>
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:51449> in non-blocking mode
10/20/16 08:02:35 IO: EOF reading packet header
10/20/16 08:02:35 DaemonCore: Can't receive command request from 192.168.56.104 (perhaps a timeout?)
10/20/16 08:02:35 Got INVALIDATE_SCHEDD_ADS
10/20/16 08:02:35 **** Removed(1) ad(s): "< srv03.hpc-dev.spookfish.com , 192.168.56.104 >"
10/20/16 08:02:35 (Invalidated 1 ads)
10/20/16 08:02:35 In OfflineCollectorPlugin::update ( 14 )
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:37344>
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:37344> in non-blocking mode
10/20/16 08:02:35 IO: EOF reading packet header
10/20/16 08:02:35 DaemonCore: Can't receive command request from 192.168.56.104 (perhaps a timeout?)
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:59649>
10/20/16 08:02:35 condor_read(): Socket closed when trying to read 5 bytes from <192.168.56.104:59649> in non-blocking mode
10/20/16 08:02:35 IO: EOF reading packet header
10/20/16 08:02:35 DaemonCore: Can't receive command request from 192.168.56.104 (perhaps a timeout?)
10/20/16 08:02:39 StartdAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx , 192.168.56.104 >"
10/20/16 08:02:39 StartdPvtAd : Inserting ** "< slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx , 192.168.56.104 >"
10/20/16 08:02:39 In OfflineCollectorPlugin::update ( 0 )
10/20/16 08:02:39 Registered TCP socket from <192.168.56.104:34800> for updates.
10/20/16 08:02:40 MasterAd : Updating ... "< srv03.hpc-dev.spookfish.com >"
10/20/16 08:02:40 In OfflineCollectorPlugin::update ( 2 )
10/20/16 08:02:40 Registered TCP socket from <192.168.56.104:52630> for updates.
10/20/16 08:03:00 Got QUERY_STARTD_PVT_ADS
10/20/16 08:03:00 ForkWorker::Fork: New child of 14255 = 14459
10/20/16 08:03:00 Number of Active Workers 0
10/20/16 08:03:00 (Sending 4 ads in response to query)
10/20/16 08:03:00 Query info: matched=4; skipped=0; query_time=0.000969; send_time=0.000492; type=MachinePrivate; requirements={true}; peer=<192.168.56.100:51532>; projection={}
10/20/16 08:03:00 ForkWork: Child 14459 done, status 0
10/20/16 08:03:00 DaemonCore: No more children processes to reap.
10/20/16 08:03:00 Got QUERY_ANY_ADS
10/20/16 08:03:00 ForkWorker::Fork: New child of 14255 = 14460
10/20/16 08:03:00 Number of Active Workers 0
10/20/16 08:03:00 (Sending 7 ads in response to query)
10/20/16 08:03:00 Query info: matched=7; skipped=7; query_time=0.001015; send_time=0.002891; type=Any; requirements={( ( ( MyType == "Scheduler" ) || ( MyType == "Submitter" ) ) || ( ( MyType == "Machine" ) ) )}; peer=<192.168.56.100:42004>; projection={}
10/20/16 08:03:00 ForkWork: Child 14460 done, status 0
10/20/16 08:03:00 DaemonCore: No more children processes to reap.
10/20/16 08:03:02 StartdAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx , 192.168.56.104 >"
10/20/16 08:03:02 StartdPvtAd : Updating ... "< slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxx , 192.168.56.104 >"
10/20/16 08:03:02 In OfflineCollectorPlugin::update ( 0 )
Sometimes, I can make the node reconnect by killing all the condor processes, then restarting condor_master on that node.
Whatâs going on here?
Many thanks for anyoneâs help.
Kind Regards
Jason
PRIVACY AND CONFIDENTIALITY NOTICE |