HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] BUG Report - condor_starter fails if DNS reverse lookup fails



Hi Folks,
We ran into a startd bug because our DNS server was down.

Jobs matched to run on machines other than the submit host would fail. The StarterLog had this -

3/29 11:45:21 Communicating with shadow <192.168.25.196:36783>
3/29 11:45:21 Shadow version: $CondorVersion: 6.7.3 Feb 17 2005 $
3/29 11:45:21 Submitting machine is "(null)"
3/29 11:45:22 ShouldTransferFiles is "NO", NOT transfering files
3/29 11:45:22 ERROR "Assertion ERROR on (shadow->name())" at line 1185 in file jic_shadow.C


Mike (Yoder) and I went through the logs and found this in the StartdLog

3/29 12:01:47 vm4: About to Create_Process "condor_starter -f -a vm4 <192.168.25.190:41755>"
3/29 12:01:47 vm4: Got RemoteUser (raj@xxxxxxxxxx) from request classad


We then looked at the src and found this in condor_startd.V6/command.C (line 812). Note that a sinful string is passed instead of a hostname. This was happening because our DNS server was down and the sin_to_hostname() was unsuccessful.

if( ! (tmp = sin_to_hostname(sock->endpoint(), NULL)) ) {
rip->dprintf( D_FULLDEBUG,
"Can't find hostname of client machine\n" );
rip->r_cur->client()->sethost( sin_to_string(sock->endpoint()) );
} else {


This, in turn, triggered a condor_starter execution with a sinful string as argument, instead of the hostname.

In condor_starter.V6.1/starter_v61_main.C, line317, we found
              if( opt[0] != '-' ) {
                               // this must be a hostname...
                       shadow_host = strdup( opt );
                       continue;
               }

The shadow_host is set to the sinful string and not the hostname. The shadow_host is passed to the JICShadow constructor as an argument, which creates a DCShadow object with the sinful_string as argument. DCShadow is derived from Daemon and its constructor sets _addr, but not _name when a sinful string is passed as argument. This means _name is NULL and unavailable.

Line 90 in condor_daemon_client/daemon.C sets _addr and not _name if the argument is a sinful string. if( name && name[0] ) {
if( is_valid_sinful(name) ) {
_addr = strnewp( name );
} else {
_name = strnewp( name );
}
}


If _name is not set, Daemon::locate() is called to find the name - we found that locate() is overridden in the DCShadow class as below -

bool
DCShadow::locate( void )
{
       return is_initialized;
}

This results in a call to DCShadow::name() to return NULL which causes the assertion that we encountered. Phew!!

We are not sure how many other daemons are affected. Possible solutions would include reviewing the Daemon class code to make sure that it can work even if the hostname is unknown - perhaps name() could return _addr if _name is NULL. Alternately, the Starter could be fixed up so that it doesn't assume that name() is always valid.

Thanks,
Raj & Mike

--
Rajesh Rajamani
Senior Member of Technical Staff
Direct : +1.408.321.9000
Fax    : +1.408.904.5992
Mobile : +1.650.218.9131
raj@xxxxxxxxxx


Optena Corporation 2860 Zanker Road, Suite 201 San Jose, CA 95134 www.optena.com


This electronic transmission (and any attached documents) contains information from Optena Corporation and is for the sole use of the individual or entity it is addressed to. If you receive this message in error, please notify me and destroy the attached message (and all attached documents) immediately.