HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] [Condor-team] <SUBSYSTE>_ADDRESS_FILE



On Tue, Aug 29, 2006 at 05:10:30PM -0500, Erik Paulson wrote:
> 
> If a daemon defines <SUBSYSTEM>_ADDRESS_FILE in its config file, DaemonCore
> automatically writes out the IP/Port that the daemon is listening on, along
> with the Platform and Version strings for that daemon:
> 
> cat .schedd_address:
> <128.105.121.52:38574>
> $CondorVersion: 6.7.20 Jun 20 2006 V6_7-db_logs_nonblocking-4-UWCS $
> $CondorPlatform: I386-LINUX_CENTOS43 $
> 
> 
> This file lets tools make assumuptions about what daemon they're contacting
> and take a fast path to get there, and not go to the collector. The best
> example of this is 'condor_q', where a user can just say 
> 
> 'condor_q' 
> 
> on a host, and condor_q will assume the user wants the queue of the schedd
> running on the local machine - it also assumes that there is only one 
> schedd running there, and there is no need to ask for it by name or even know
> the name of the schedd, just connect up to the daemon running at 
> <128.105.121.52:38574>
> 
> condor_q using quillpp will need to know the name of the schedd, because
> the key for entries in the job is a <scheddname, cluster, proc> tuple. 
> Would there be any objections if I added another line to the .address_file
> with the daemon's ATTR_NAME?
> 
> -Erik

The feedback I got was to make this more general-purpose than just sticking in
another line to the address file. In fact, it seemed the ideal thing to do was 
to have the daemon's ClassAd be available, so that's what I did. 

However, the ad of a daemon is sort of divorced from DaemonCore - the daemon
itself is responsible for creating its ad, and getting it to the collector.
DaemonCore does not directly know where the ad for a daemon is located.

Rather than mess with backwards compatibility, I created a new file, the
<SUBSYSTEM>_DAEMON_AD_FILE. A daemon can ask DaemonCore to update that
file from time to time - I had imagined it being at reconfigs, but you could
imagine it might be useful to do whenever the ClassAd was being sent off to
the collector. For now, I'm only interested in static data from the schedd, so
it's only being updated at reconfig.

The attached patch adds a new method to daemonCore which updates the classad,
has daemonCore do some automatic cleanup of the address file at shutdown
time, and changes daemon client to use the classad to locate a daemon instead
of the address file, if it can find it.

I'm not terribly thrilled about just shoving a pointer into daemon core as 
a public member, but I didn't want to have a daemon core member function
rely on a global variable. 

I am not heart-set on any of the config file names, so any substitutions
are welcome. It's against the trunk. If you want to see it in context, see
/p/condor/workspaces/epaulson.1/src_trees/V69

-Erik

Index: condor_daemon_client/daemon.C
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_daemon_client/daemon.C,v
retrieving revision 1.7
diff -r1.7 daemon.C
1285c1285,1288
< 		readAddressFile( subsys );
---
> 		bool foundLocalAd = readLocalClassAd( subsys );
> 		if(!foundLocalAd) {
> 			readAddressFile( subsys );
> 		}
1800a1804,1865
> bool
> Daemon::readLocalClassAd( const char* subsys )
> {
> 	char* addr_file;
> 	FILE* addr_fp;
> 	ClassAd *adFromFile;
> 	MyString param_name;
> 	MyString buf;
> 	bool rval = false;
> 
> 	param_name.sprintf( "%s_DAEMON_AD_FILE", subsys );
> 	addr_file = param( param_name.Value() );
> 	if( ! addr_file ) {
> 		return false;
> 	}
> 
> 	dprintf( D_HOSTNAME, "Finding classad for local daemon, "
> 			 "%s is \"%s\"\n", param_name.Value(), addr_file );
> 
> 	if( ! (addr_fp = fopen(addr_file, "r")) ) {
> 		dprintf( D_HOSTNAME,
> 				 "Failed to open classad file %s: %s (errno %d)\n",
> 				 addr_file, strerror(errno), errno );
> 		free( addr_file );
> 		return false;
> 	}
> 		// now that we've got a FILE*, we should free this so we don't
> 		// leak it.
> 	free( addr_file );
> 	addr_file = NULL;
> 
> 	int adIsEOF, errorReadingAd, adEmpty = 0;
> 	adFromFile = new ClassAd(addr_fp, "...", adIsEOF, errorReadingAd, adEmpty);
> 	fclose(addr_fp);
> 
> 	if(errorReadingAd) {
> 		return false;	// did that just leak adFromFile?
> 	}
> 
> 	// construct the IP_ADDR attribute
> 	buf.sprintf( "%sIpAddr", subsys );
> 	if( initStringFromAd(adFromFile, buf.Value(), &_addr) ) {
> 		_tried_locate = true;		
> 	} else { return false; }
> 
> 	if( initStringFromAd( adFromFile, ATTR_VERSION, &_version ) ) {
> 		_tried_init_version = true;
> 	} else { return false; }
> 
> 	initStringFromAd( adFromFile, ATTR_PLATFORM, &_platform );
> 
> 	initStringFromAd( adFromFile, ATTR_NAME, &_name );
> 
> 	if( initStringFromAd( adFromFile, ATTR_MACHINE, &_full_hostname ) ) {
> 		initHostnameFromFull();
> 		_tried_init_hostname = false;
> 	} else { return false; }
> 
> 	delete adFromFile;
> 	return true;
> }
> 
1824c1889
< 	dprintf( D_HOSTNAME, "Found %s in ClassAd from collector, "
---
> 	dprintf( D_HOSTNAME, "Found %s in ClassAd from ClassAd, "
Index: condor_daemon_client/daemon.h
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_daemon_client/daemon.h,v
retrieving revision 1.6
diff -r1.6 daemon.h
536a537,542
> 		/** Code for parsing a locally-written classad, which should
> 			contain everything about the daemon
> 			@return true if we found everthing in the ad, false if not
> 		*/
> 	bool readLocalClassAd( const char* subsys );
> 
Index: condor_daemon_core.V6/condor_daemon_core.h
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_daemon_core.V6/condor_daemon_core.h,v
retrieving revision 1.48
diff -r1.48 condor_daemon_core.h
1036a1037,1039
> 	char 	*localAdFile;
> 	void	UpdateLocalAd(ClassAd *daemonAd); 
> 
Index: condor_daemon_core.V6/daemon_core.C
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_daemon_core.V6/daemon_core.C,v
retrieving revision 1.95
diff -r1.95 daemon_core.C
356a357,358
> 	
> 	localAdFile = NULL;
469a472,478
> 
> 		
> 	if(localAdFile) {
> 		free(localAdFile);
> 		localAdFile = NULL;
> 	}
> 	
8222a8232,8259
> 
> 
> void
> DaemonCore::UpdateLocalAd(ClassAd *daemonAd) 
> {
>     FILE    *AD_FILE;
>     char    localAd_path[100];
> 
>     sprintf( localAd_path, "%s_DAEMON_AD_FILE", mySubSystem );
> 
> 	//localAdFile is a global from daemon_core_main.C
>     if( localAdFile ) {
>         free( localAdFile );
>     }
>     localAdFile = param( localAd_path );
> 
>     if( localAdFile ) {
>         if( (AD_FILE = fopen(localAdFile, "w")) ) {
>             daemonAd->fPrint(AD_FILE);
>             fclose( AD_FILE );
>         } else {
>             dprintf( D_ALWAYS,
>                      "DaemonCore: ERROR: Can't open daemon address file %s\n",
>                      localAdFile );
>         }
>     }
> }
> 
Index: condor_daemon_core.V6/daemon_core_main.C
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_daemon_core.V6/daemon_core_main.C,v
retrieving revision 1.70
diff -r1.70 daemon_core_main.C
193a194,211
> 	
> 	if(daemonCore) {
> 		if( daemonCore->localAdFile ) {
> 			if( unlink(daemonCore->localAdFile) < 0 ) {
> 				dprintf( D_ALWAYS, 
> 						 "DaemonCore: ERROR: Can't delete classad file %s\n",
> 						 daemonCore->localAdFile );
> 			} else {
> 				if( DebugFlags & (D_FULLDEBUG | D_DAEMONCORE) ) {
> 					dprintf( D_DAEMONCORE, "Removed local classad file %s\n", 
> 							 daemonCore->localAdFile );
> 				}
> 			}
> 			free( daemonCore->localAdFile );
> 			daemonCore->localAdFile = NULL;
> 		}
> 	}
> 
268d285
< 
Index: condor_schedd.V6/schedd.C
===================================================================
RCS file: /p/condor/repository/CONDOR_SRC/src/condor_schedd.V6/schedd.C,v
retrieving revision 1.246
diff -r1.246 schedd.C
9791a9792,9798
> 	// This is foul, but a SCHEDD_ADTYPE _MUST_ have a NUM_USERS attribute
> 	// (see condor_classad/classad.C
> 	// Since we don't know how many there are yet, just say 0, it will get
> 	// fixed in count_job() -Erik 12/18/2006
> 	sprintf(expr, "%s = %d", ATTR_NUM_USERS, 0);
>     ad->Insert(expr);
> 
9868a9876
> 	daemonCore->UpdateLocalAd(ad);