[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor fault tolerance



Hi Lans,

Thanks for the info.

I'm running 7.4.2 and I added that line to my condor_config on my
startd's and schedd's, but it doesn't seem to help. Do I have to use
7.5.4 or later for this feature (I see that's when the default appears
to change)? There aren't any firewalls, etc. between these systems.

With that change, what is the max interval I should expect to see
before the system adjusts?

Thanks,
Paul


On Mon, Oct 11, 2010 at 4:19 PM, Lans Carstensen
<Lans.Carstensen@xxxxxxxxxxxxxx> wrote:
> Paul Marshall wrote:
>>
>> Hello,
>>
>> I haven't been able to find any more up-to-date information on this issue:
>>
>> https://www-auth.cs.wisc.edu/lists/condor-users/2007-March/msg00026.shtml
>>
>> Could someone point me in the right direction? What is the best way to
>> decrease the time that it takes Condor to recognize a node has failed
>> and drop it from the system?
>
> There's work going on to reverse the keepalive message direction:
>
> https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=671
>
> You can experiment with that on your own by setting the following on both
> your startd's and schedd's:
>
> STARTD_SENDS_ALIVES=true
>
> -- Lans Carstensen
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
>