Dan,I noticed in the 6.7.15 release notes that negotiation is improved, and since we've got interest in some 6.7 features anyway, I've upgraded our pools to
6.7.17 (we were on 6.6.10).Haven't seen any timeouts since, but then again, I've not had quite such huge
queues as we did on the day I first hit this... -Preston On Mar 17, 2006, at 11:42 AM, Dan Bradley wrote:
Preston, Any word on the schedd scaling issues? I just realized that I described the meaning of SCHEDD_TIMEOUT_MULTIPLIER backwards from how it actually is. This setting increases the timeouts used by the schedd when communicating with others. In general <SUBSYS>_TIMEOUT_MULTIPLIER increases the network timeouts used by a particular subsystem of Condor.Therefore, if you are seeing timeouts in the shadow logs, you should try setting SHADOW_TIMEOUT_MULTIPLIER to some integer value greater than 1.Also, if your negotiator logs show evidence that the schedd is not requesting claims in time for the next negotiation cycle, you may wantto increase NEGOTIATOR_CYCLE_DELAY. The log message that would indicatethis sort of problem is this: 3/6 10:14:15 Resource vm3@xxxxxxxxxxxx@<nnn.nnn.nnn.nnn:34558> was not claimed by user@xxxxxxxxxxx - removing match --Dan Preston Smith wrote:Dan, I'd read about NEGOTIATOR_TIMEOUT and turned it up to 60, but it wasn't enough. Are there any formulas, so to speak, for setting a good value for it on a busy schedd? Don't want to set it too high.. Didn't know about SCHEDD_TIMEOUT_MULTIPLIER, though, I'll try that, too.. Thanks, -Preston On Mar 2, 2006, at 3:21 PM, Dan Bradley wrote:Preston, I haven't looked at all of your reports in detail, but I'm guessing you may need to adjust some of the following timeouts if the schedd is not responding quickly enough to queries: NEGOTIATOR_TIMEOUT Sets the timeout that the negotiator uses on its network connections to the condor_ schedd and condor_ startds. It is defined in seconds and defaults to 30. SCHEDD_TIMEOUT_MULTIPLIER Set this to some integer (e.g. 2 or 10) to increase the timeouts that are used when communicating with the schedd. --Dan On Mar 2, 2006, at 1:44 PM, Preston Smith wrote:On Mar 1, 2006, at 1:12 PM, Maxim Kovgan wrote:Hi, Preston. Qs: * Are you using host based firewalls ?No.* Can you look at /var/log/messages too ?Nothing syslogged besides gridftp connections.* Are you using a good equipment (routers/switches) ?Yea. All my condor gear is directly connected into a Cisco 6509 core switch. Cluster nodes are all on cisco 4948 leaf switches with 10 Gbit links back to said core switch.* What is the topology of your network ?see aboveThis schedd has been humming along busily for weeks, right up untilI suspect the problem is either with OS or network, anyway, not condor related.it got to about 3000 jobs queued up. The problem goes away when I hold half or so of the jobs in this schedd. Now, with a large chunk of the queue held, condor's negotiated and startedhundreds of jobs like it should. I've got the queue drained by now,though, just by holding a big chunk, and periodically releasing 6-700 jobs.. So while I never really solved the problem, I've worked around it. -Preston -- Preston Smith <psmith@xxxxxxxxxx> Systems Research Engineer Rosen Center for Advanced Computing, Purdue University _______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users-- Preston Smith <psmith@xxxxxxxxxx> Systems Research Engineer Rosen Center for Advanced Computing, Purdue University _______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users
-- Preston Smith <psmith@xxxxxxxxxx> Systems Research Engineer Rosen Center for Advanced Computing, Purdue University