Hi, I've run different performance test for six days on HTCondor, and during that period, from time to time my negotiator got killed and restarted because: "/usr/sbin/condor_negotiator" on "condormaster1" was killed because it was no longer responding. Condor will automatically restart this process in 10 seconds. And the last log lines are just completely ordinary, nothing suspicious in them. I can't see any obvious bottleneck, except there's a peak on the networking plots, ~450 sent packets/sec, ~0.4 sent MBytes/sec, ~900 received packets/sec, ~1.4 received MBytes/sec. I have 100 subcollectors running on 2 machines (50-50), and one of the machines runs the main collector, the other runs the negotiator. I have ~700 worker nodes with 33 600 jobslots (I turn them on and off during the test), and during the tests, I submitted multiple times something like 80 000 - 400 000 jobs spread among 10 schedds. So could say that I've done everything to keep the negotiator really busy. I attach a weekly graph during the testing period. [root@condormaster1 ~]# condor_version $CondorVersion: 8.1.2 Oct 19 2013 BuildID: 189797 $ $CondorPlatform: x86_64_RedHat6 $ Thanks, Daniel
Attachment:
graph.png
Description: PNG image