HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] serious bug in head of 6_7-branch - EXCEPT() kills other daemons



A misconfigured quill++ hit an EXCEPT() this morning, but the strange
thing was that all of the other daemons exited as well with SIGTERMS.

This same behaviour happens on the head of the V6_7-branch as well. The
quill++ is  V6_7-db_logs_nonblocking-4-branch, which was created from
V6_7-branch-2006-5-16, so it's been in the code for at least a month.

To reproduce, I modified the schedd to just EXCEPT:

void
Scheduler::timeout()
{
    static bool min_interval_timer_set = false;
    static time_t next_timeout = 0;
    time_t right_now;

    EXCEPT("Error: world is round!\n");
    right_now = time(NULL);
    if ( right_now < next_timeout ) {
        if (!min_interval_timer_set) {
            daemonCore->Reset_Timer(timeoutid,next_timeout - right_now,1);


and started up a personal condor:

6/20 12:21:32 ******************************************************
6/20 12:21:32 ** condor_master (CONDOR_MASTER) STARTING UP
6/20 12:21:32 ** /scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_master
6/20 12:21:32 ** $CondorVersion: 6.7.21 Jun 20 2006 PRE-RELEASE-UWCS $
6/20 12:21:32 ** $CondorPlatform: I386-LINUX_CENTOS43 $
6/20 12:21:32 ** PID = 22036
6/20 12:21:32 ** Log last touched time unavailable (No such file or directory)
6/20 12:21:32 ******************************************************
6/20 12:21:32 Using config source: /scratch.1/epaulson/V67-build/src/runtime/condor_config
6/20 12:21:32 DaemonCore: Command Socket at <128.105.121.52:33196>
6/20 12:21:32 Started DaemonCore process "/scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_collector", pid and pgroup = 22037
6/20 12:21:32 Started DaemonCore process "/scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_negotiator", pid and pgroup = 22038
6/20 12:21:32 Started DaemonCore process "/scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_startd", pid and pgroup = 22039
6/20 12:21:32 Started DaemonCore process "/scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_schedd", pid and pgroup = 22040
6/20 12:21:32 The SCHEDD (pid 22040) exited with status 4
6/20 12:21:32 Sending obituary for "/scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_schedd"
6/20 12:21:32 restarting /scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_schedd in 10 seconds
6/20 12:21:32 The COLLECTOR (pid 22037) exited with status 0
6/20 12:21:32 restarting /scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_collector in 10 seconds
6/20 12:21:32 attempt to connect to <128.105.121.52:33207> failed
6/20 12:21:33 condor_write(): Socket closed when trying to write buffer, fd is 9, errno=107
6/20 12:21:33 Buf::write(): condor_write() failed
6/20 12:21:33 SECMAN: failed to end classad message
6/20 12:21:33 ERROR: SECMAN:2004:Failed to start a session to <128.105.121.52:32884> with TCP|SECMAN:2007:Failed to end classad message
6/20 12:21:33 Failed to start non-blocking update to <128.105.121.52:32884>.
6/20 12:21:33 The NEGOTIATOR (pid 22038) exited with status 0
6/20 12:21:33 restarting /scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_negotiator in 10 seconds


Here's a collector log: (not that it's interesting)

6/20 12:21:32 ******************************************************
6/20 12:21:32 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
6/20 12:21:32 ** /scratch.1/epaulson/V67-build/src/release_dir/sbin/condor_collector
6/20 12:21:32 ** $CondorVersion: 6.7.21 Jun 20 2006 PRE-RELEASE-UWCS $
6/20 12:21:32 ** $CondorPlatform: I386-LINUX_CENTOS43 $
6/20 12:21:32 ** PID = 22037
6/20 12:21:32 ** Log last touched time unavailable (No such file or directory)
6/20 12:21:32 ******************************************************
6/20 12:21:32 Using config source: /scratch.1/epaulson/V67-build/src/runtime/condor_config
6/20 12:21:32 DaemonCore: Command Socket at <128.105.121.52:21002>
6/20 12:21:32 In ViewServer::Init()
6/20 12:21:32 In CollectorDaemon::Init()
6/20 12:21:32 In ViewServer::Config()
6/20 12:21:32 In CollectorDaemon::Config()
6/20 12:21:32 enable: Creating stats hash table
6/20 12:21:32 (Sending 0 ads in response to query)
6/20 12:21:32 Got QUERY_STARTD_PVT_ADS
6/20 12:21:32 (Sending 0 ads in response to query)
6/20 12:21:32 NegotiatorAd  : Inserting ** "< epaulson@xxxxxxxxxxxxxxxxxxxxx >"
6/20 12:21:32 stats: Inserting new hashent for 'Negotiator':'epaulson@xxxxxxxxxxxxxxxxxxxxx':'128.105.121.52'
6/20 12:21:32 Got SIGTERM. Performing graceful shutdown.
6/20 12:21:32 **** condor_collector (condor_COLLECTOR) EXITING WITH STATUS 0