Hi!
So finally I eliminated the possibility of IO bottlenecks by putting
/var/lib/condor and /var/log/condor onto tmpfs. I also set
transfer_executable = False and should_transfer_files = no (against
networking bottleneck).
Now I suspect the bottleneck is the number of context switches:
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 10 1325 102 217 0 0 1 25 9 11 3 2 93 2 0
0 0 10 1324 102 217 0 0 0 8 1945 3087 10 8 79 3 0
0 0 10 1323 102 217 0 0 0 0 1964 3105 13 8 80 0 0
0 0 10 1322 102 217 0 0 0 0 2267 3608 12 9 79 0 0
0 0 10 1321 102 217 0 0 0 34 1502 2395 8 6 86 0 0
0 0 10 1320 102 217 0 0 0 0 1969 3088 13 8 79 0 1
0 0 10 1319 102 217 0 0 0 84 2291 3654 12 9 76 4 0
1 0 10 1318 102 217 0 0 0 0 2083 3089 23 10 67 0 0
0 0 10 1317 102 218 0 0 0 0 2070 3303 10 9 81 0 1
0 0 10 1316 102 218 0 0 0 54 1257 1994 6 5 88 2 0
0 0 10 1315 102 218 0 0 0 0 1975 3146 12 8 80 0 0
0 0 10 1314 102 218 0 0 0 0 2375 3810 12 10 79 0 0
0 0 10 1313 102 218 0 0 0 0 2017 3158 13 8 78 0 1
3800/sec seems a little too much. Any idea how can I tune condor or
linux against this?
Thanks,
Daniel
2013/8/2 Todd Tannenbaum <tannenba@xxxxxxxxxxx>:
> On 8/2/2013 9:09 AM, Dan Bradley wrote:
>>
>>
>> Be aware that turning off fsync in the condor_schedd can lead to loss of
>> job state in the event of power loss or other sudden death of the
>> schedd. This could result in jobs that were submitted shortly before
>> the outage disappearing from the queue without being run. It could also
>> result in jobs being run twice.
>>
>> If that is acceptable for your purposes, then your problem is solved. If
>> it is not acceptable, then focus on improving the performance of the
>> filesystem containing $(SPOOL).
>>
>
> FWIW, on our busy submit nodes (dozens of users with typically thousands of
> running jobs), we put $(SPOOL) on a solid-state drive (SSD). Specifically,
> we mount the SSD on /ssd and then put in condor_config:
> JOB_QUEUE_LOG = /ssd/condor_spool/job_queue.log
> The above allows us to put the job_queue.log onto the SSD - this is the
> schedd's jobs queue and the file that gets a lot of fsyncs on transaction
> boundaries. By using JOB_QUEUE_LOG, we can use a small/cheap SSD that does
> not have to be large enough to hold the entire contents of the $(SPOOL)
> directory.
>
> Performance is greatly improved and the risks Dan outlines above are
> avoided.
>
> -Todd
>
>
>> --Dan
>>
>> On 8/2/13 8:14 AM, Pek Daniel wrote:
>>>
>>> Thanks, the FSYNC trick solved the issue! :)
>>>
>>>
>>> 2013/8/1 Dan Bradley <dan@xxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxx>>
>>>
>>>
>>>
>>> Are you timing just condor_submit, or are you also timing job
>>> run/completion rates?
>>>
>>> Job submissions cause the schedd to commit a transaction to
>>> $(SPOOL)/job_queue.log. If the disk containing that is slow,
>>> submissions will be slow. One way to verify if this is the
>>> limiting factor is to add the following to your configuration:
>>>
>>> CONDOR_FSYNC = FALSE
>>>
>>> Another thing to keep in mind is that if you can batch submissions
>>> of many jobs into a single submit file, there will be fewer
>>> transactions.
>>>
>>> --Dan
>>>
>>>
>>> On 8/1/13 10:17 AM, Pek Daniel wrote:
>>>
>>> Hi!
>>>
>>> I'm experimenting with condor: I'm trying to submit a lot of
>>> dummy
>>> jobs with condor_submit from multiple submission hosts
>>> simultaneously.
>>> I have only a single schedd. I'm trying to stresstest this
>>> schedd.
>>> These jobs are in the vanilla universe.
>>>
>>> The problem is that I couldn't reach better result than 4-6
>>> submission/sec, which seems a little low. I can't see any real
>>> bottleneck on the machine, so I suspect that it's because of some
>>> default value of a configuration option which throttles down the
>>> submission requests.
>>>
>>> Any idea how to solve this?
>>>
>>> Thanks,
>>> Daniel
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to
>>> htcondor-users-request@xxxxxxxxxxx
>>> <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>>>
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing Department of Computer Sciences
> HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132 Madison, WI 53706-1685
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/