Re: [HTCondor-users] slow submission rate

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Tom Downes

Associate Scientist and Data Center Manager

Center for Gravitation, Cosmology and Astrophysics

University of Wisconsin-Milwaukee

414.229.2678

On Wed, Aug 7, 2013 at 6:11 AM, Pek Daniel <pekdaniel@xxxxxxxxx> wrote:

Hi!

So finally I eliminated the possibility of IO bottlenecks by putting
/var/lib/condor and /var/log/condor onto tmpfs. I also set
transfer_executable = False and should_transfer_files = no (against
networking bottleneck).

Now I suspect the bottleneck is the number of context switches:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 10 1325 102 217 0 0 1 25 9 11 3 2 93 2 0
0 0 10 1324 102 217 0 0 0 8 1945 3087 10 8 79 3 0
0 0 10 1323 102 217 0 0 0 0 1964 3105 13 8 80 0 0
0 0 10 1322 102 217 0 0 0 0 2267 3608 12 9 79 0 0
0 0 10 1321 102 217 0 0 0 34 1502 2395 8 6 86 0 0
0 0 10 1320 102 217 0 0 0 0 1969 3088 13 8 79 0 1
0 0 10 1319 102 217 0 0 0 84 2291 3654 12 9 76 4 0
1 0 10 1318 102 217 0 0 0 0 2083 3089 23 10 67 0 0
0 0 10 1317 102 218 0 0 0 0 2070 3303 10 9 81 0 1
0 0 10 1316 102 218 0 0 0 54 1257 1994 6 5 88 2 0
0 0 10 1315 102 218 0 0 0 0 1975 3146 12 8 80 0 0
0 0 10 1314 102 218 0 0 0 0 2375 3810 12 10 79 0 0
0 0 10 1313 102 218 0 0 0 0 2017 3158 13 8 78 0 1

3800/sec seems a little too much. Any idea how can I tune condor or
linux against this?

Thanks,
Daniel

2013/8/2 Todd Tannenbaum <tannenba@xxxxxxxxxxx>:

> On 8/2/2013 9:09 AM, Dan Bradley wrote:
>>
>>
>> Be aware that turning off fsync in the condor_schedd can lead to loss of
>> job state in the event of power loss or other sudden death of the
>> schedd. This could result in jobs that were submitted shortly before
>> the outage disappearing from the queue without being run. It could also
>> result in jobs being run twice.
>>
>> If that is acceptable for your purposes, then your problem is solved. If
>> it is not acceptable, then focus on improving the performance of the
>> filesystem containing $(SPOOL).
>>
>
> FWIW, on our busy submit nodes (dozens of users with typically thousands of
> running jobs), we put $(SPOOL) on a solid-state drive (SSD). Specifically,
> we mount the SSD on /ssd and then put in condor_config:
> JOB_QUEUE_LOG = /ssd/condor_spool/job_queue.log
> The above allows us to put the job_queue.log onto the SSD - this is the
> schedd's jobs queue and the file that gets a lot of fsyncs on transaction
> boundaries. By using JOB_QUEUE_LOG, we can use a small/cheap SSD that does
> not have to be large enough to hold the entire contents of the $(SPOOL)
> directory.
>
> Performance is greatly improved and the risks Dan outlines above are
> avoided.
>
> -Todd
>
>
>> --Dan
>>
>> On 8/2/13 8:14 AM, Pek Daniel wrote:
>>>
>>> Thanks, the FSYNC trick solved the issue! :)
>>>
>>>
>>> 2013/8/1 Dan Bradley <dan@xxxxxxxxxxxx <mailto:dan@xxxxxxxxxxxx>>
>>>
>>>
>>>
>>> Are you timing just condor_submit, or are you also timing job
>>> run/completion rates?
>>>
>>> Job submissions cause the schedd to commit a transaction to
>>> $(SPOOL)/job_queue.log. If the disk containing that is slow,
>>> submissions will be slow. One way to verify if this is the
>>> limiting factor is to add the following to your configuration:
>>>
>>> CONDOR_FSYNC = FALSE
>>>
>>> Another thing to keep in mind is that if you can batch submissions
>>> of many jobs into a single submit file, there will be fewer
>>> transactions.
>>>
>>> --Dan
>>>
>>>
>>> On 8/1/13 10:17 AM, Pek Daniel wrote:
>>>
>>> Hi!
>>>
>>> I'm experimenting with condor: I'm trying to submit a lot of
>>> dummy
>>> jobs with condor_submit from multiple submission hosts
>>> simultaneously.
>>> I have only a single schedd. I'm trying to stresstest this
>>> schedd.
>>> These jobs are in the vanilla universe.
>>>
>>> The problem is that I couldn't reach better result than 4-6
>>> submission/sec, which seems a little low. I can't see any real
>>> bottleneck on the machine, so I suspect that it's because of some
>>> default value of a configuration option which throttles down the
>>> submission requests.
>>>
>>> Any idea how to solve this?
>>>
>>> Thanks,
>>> Daniel
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to
>>> htcondor-users-request@xxxxxxxxxxx
>>> <mailto:htcondor-users-request@xxxxxxxxxxx> with a
>>>
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing Department of Computer Sciences
> HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132 Madison, WI 53706-1685
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] slow submission rate