HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] [Condor-users] Antwort: Re: Fault Behaviour of Condor



On 8/11/06, Nomura Kohei <kh-nomura@xxxxxxxxx> wrote:
> incidentally did you snip a bunch of stuff from the logs after the second
> line?
No i didn't snip.

> This is starting to smell like it might be a bug but the condor guys
> would probably be much better at debugging it from here...
Ok,  i will contact CondorTeam.

Just did a **really** quick look at this, reading mainly the comments
not the code (always a bad idea but it takes at least half an hour
before my brain can read C with any kind of utility these days.

condor_schedd.V6.C line 10611

/*
 * go through match reords and send alive messages to all the startds.
*/

bool
sendAlive( match_rec* mrec )
{
	SafeSock	sock;
	char		*id = NULL;
	
   sock.timeout(STARTD_CONTACT_TIMEOUT);
	sock.encode();

	DCStartd d( mrec->peer );
	id = mrec->id;

	dprintf (D_PROTOCOL,"## 6. Sending alive msg to %s\n", id);

	if( !sock.connect(mrec->peer) || !d.startCommand ( ALIVE, &sock) ||
	    !sock.code(id) || !sock.end_of_message()) {
			// UDP transport out of buffer space!
		dprintf(D_ALWAYS, "\t(Can't send alive message to %s)\n",
				mrec->peer);
		return false;
	}
		/* TODO: Someday, espcially once the accountant is done,
		   the startd should send a keepalive ACK back to the schedd.
		   If there is no shadow to this machine, and we have not
		   had a startd keepalive ACK in X amount of time, then we
		   should relinquish the match.  Since the accountant is
		   not done and we are in fire mode, leave this
		   for V6.1.  :^) -Todd 9/97
		*/
	return true;
}

Would indicate that (assuming the keep alive timer was running) that
it believed the remote machine was there or there would have been a
log statement to that effect. Based on this:

	while (matches->iterate(mrec) == 1) {
		if( mrec->status == M_ACTIVE ) {
			SetAttributeInt( mrec->cluster, mrec->proc,
							 ATTR_LAST_JOB_LEASE_RENEWAL, now );
		}
	}

The last lease renewal would have been set but...

8/3 15:41:37 (290.0) (372): LastJobLeaseRenewal: 1154580099 Thu Aug 03 13:41:39 2006

Would indicate it was using the value from being started (or if it got
a set just before the NIC was pulled)

This seems to indicate that either the job lease timer wasn't firing
or this particular claim was not in the loop right?

Do you think I should suggest he enable D_PROTOCOL debugging or just
leave it to you lot :)

Matt