HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] STARTD_SENDS_ALVIES = TRUE is known safe?



Matthew Farrellee wrote:
Recently the default for STARTD_SENDS_ALIVES changed from FALSE to TRUE.

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=671
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1420


Are these tickets dups of each other, and one should be abandoned?

A few questions -

o What happens in the case of a pre-7.5.4 Schedd configured with STARTD_SENDS_ALIVES=TRUE and a post-7.5.4 Startd?

This is a concern. Currently the code requires the startd and schedd to be in agreement (or at least, this is how it was before Jaime's patch). This knob should be centralized into just the schedd, and the startd does whichever it is told.

o We can currently disconnect a Schedd and it will reconnect with running jobs when it returns.
  . Is this still possible with the Startd sending the alives?

Yes.

  . What is the impact on the Startd when the Schedd is not accessible?

None.  All communication is non-blocking.

  . Is a test being written for shadow-starter reconnect?

An orthogonal issue, but definitely a good idea.

 o A benefit of STARTD_SENDS_ALIVES is that it is TCP and ACKd.
. What other configuration changes must be done to a Schedd that is managing 10Ks of jobs?

Don't see what you are driving at w/ regards to the issue of having the startd send alives.

. #671 suggests offloading work from the Schedd, what's the impact on the Schedd performance in responding to ALIVES? What extra resources are used?

Arguably, the impact may be less than how it worked before. Before, the schedd in one big blocking loop sprayed out packets to all startds. Now the individual responses are come in and can be "time-shared" across other daemon core activities, e.g. timers and other commands can occur inbetween responding to different alives.

The above is just theory.... We do have some concrete evidence of STARTD_SENDS_ALIVES mechanism being scalable, however, as both CMS and ATLAS glideins have used this option for years, and these are some of the largest pools we know about.

o Before, if we wanted to renew all leases when a Schedd is going down we could with a small change to the Schedd. Is this still possible, or must a new protocol be created between Schedd and Startd?


Probably no longer possible, albeit we haven't taken advantage of this to capability to date....

regards,
Todd

Best,


matt
_______________________________________________
Condor-devel mailing list
Condor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-devel


--
Todd Tannenbaum                       University of Wisconsin-Madison
Center for High Throughput Computing  Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685