[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-devel] BUG Report - condor_starter fails if DNS reverse lookup fails
- Date: Tue, 29 Mar 2005 13:53:04 -0800
- From: Rajesh Rajamani <raj@xxxxxxxxxx>
- Subject: [Condor-devel] BUG Report - condor_starter fails if DNS reverse lookup fails
Hi Folks,
We ran into a startd bug because our DNS server was down.
Jobs matched to run on machines other than the submit host would fail.
The StarterLog had this -
3/29 11:45:21 Communicating with shadow <192.168.25.196:36783>
3/29 11:45:21 Shadow version: $CondorVersion: 6.7.3 Feb 17 2005 $
3/29 11:45:21 Submitting machine is "(null)"
3/29 11:45:22 ShouldTransferFiles is "NO", NOT transfering files
3/29 11:45:22 ERROR "Assertion ERROR on (shadow->name())" at line 1185
in file jic_shadow.C
Mike (Yoder) and I went through the logs and found this in the StartdLog
3/29 12:01:47 vm4: About to Create_Process "condor_starter -f -a vm4
<192.168.25.190:41755>"
3/29 12:01:47 vm4: Got RemoteUser (raj@xxxxxxxxxx) from request classad
We then looked at the src and found this in condor_startd.V6/command.C
(line 812). Note that a sinful string is passed instead of a
hostname. This was happening because our DNS server was down and the
sin_to_hostname() was unsuccessful.
if( ! (tmp = sin_to_hostname(sock->endpoint(), NULL)) ) {
rip->dprintf( D_FULLDEBUG,
"Can't find hostname of client
machine\n" );
rip->r_cur->client()->sethost(
sin_to_string(sock->endpoint()) );
} else {
This, in turn, triggered a condor_starter execution with a sinful string
as argument, instead of the hostname.
In condor_starter.V6.1/starter_v61_main.C, line317, we found
if( opt[0] != '-' ) {
// this must be a hostname...
shadow_host = strdup( opt );
continue;
}
The shadow_host is set to the sinful string and not the hostname. The
shadow_host is passed to the JICShadow constructor as an argument, which
creates a DCShadow object with the sinful_string as argument. DCShadow
is derived from Daemon and its constructor sets _addr, but not _name
when a sinful string is passed as argument. This means _name is NULL
and unavailable.
Line 90 in condor_daemon_client/daemon.C sets _addr and not _name if the
argument is a sinful string.
if( name && name[0] ) {
if( is_valid_sinful(name) ) {
_addr = strnewp( name );
} else {
_name = strnewp( name );
}
}
If _name is not set, Daemon::locate() is called to find the name - we
found that locate() is overridden in the DCShadow class as below -
bool
DCShadow::locate( void )
{
return is_initialized;
}
This results in a call to DCShadow::name() to return NULL which causes
the assertion that we encountered. Phew!!
We are not sure how many other daemons are affected. Possible solutions
would include reviewing the Daemon class code to make sure that it can
work even if the hostname is unknown - perhaps name() could return _addr
if _name is NULL. Alternately, the Starter could be fixed up so that it
doesn't assume that name() is always valid.
Thanks,
Raj & Mike
--
Rajesh Rajamani
Senior Member of Technical Staff
Direct : +1.408.321.9000
Fax : +1.408.904.5992
Mobile : +1.650.218.9131
raj@xxxxxxxxxx
Optena Corporation
2860 Zanker Road, Suite 201
San Jose, CA 95134
www.optena.com
This electronic transmission (and any attached documents) contains information from Optena Corporation and is for the sole use of the individual or entity it is addressed to. If you receive this message in error, please notify me and destroy the attached message (and all attached documents) immediately.