Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Fault Behaviour of Condor
- Date: Mon, 14 Aug 2006 17:31:31 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Fault Behaviour of Condor
At 10:57 AM 8/2/2006, Matt Hope wrote:
On 8/2/06, thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx
<thomas.t.hoppe@xxxxxxxxxxxxxxxxxxx> wrote:
> 3.) Shutting down the NIC on the executor (I assume same as pulling the
> plug)
> Outcome: Condor hangs, a shadow process is existing all the time
> I even cannot remove the job with condor_rm!
> Maybe a bug? what can I do?
condor_rm -forcex may get rid of it (you may need to kill off the
shadow by hand, it should eventually timeout though, how long did you
give it?).
Here is the story:
Condor sends regular "pings" from the submit machine (schedd) to the
execute machine (startd). Thus the execute machine knows relatively
quickly if a submit machine disappears (because it will not receive a
ping within the anticipated timeframe). You can configure how often
these pings happen and how quickly the Condor execute machine will
throw off a current job by tweaking the job lease parameter in the submit file.
However, going the other way is a different story. There is currently
no way to configure how often "pings" happen from the execute machine
back to the submit machine. Thus, there is no way to configure how
long it takes before a submit machine (the condor_shadow) notices
when a execute machine falls off of the network. Have no fear,
however, as Condor most definitely *will* notice it eventually, but
you may need to be patient. The socket created by the condor_shadow
is using TCP's KEEPALIVE option on the socket. However, the standard
for TCP says it only needs to send a ping every two hours. So in
the worst case, the shadow may take up to a max of two hours to
notice if the execute machine has fallen off the network without a
trace. This is an issue we'd like to improve.
Hope this helps clarify things,
Todd
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba@xxxxxxxxxxx 1210 W. Dayton St. Rm #4257
http://www.cs.wisc.edu/~tannenba Madison, WI 53706-1685
Phone: (608) 263-7132 FAX: (608) 262-9777