Should I expect moving the $(SPOOL) directory (or just the queue log) to
SSD to increase the responsivity of condor_q in addition to potentially
increased job submission rates?
I am in the regime Todd mentioned: dozens of users with 2-3000 jobs per
submit host. I want to upgrade the disks anyway and will probably switch
to SSD for condor partitions, but I'd like to set my expectations (and
cost justifications) appropriately.
I tried turning off fsync and running condor_reconfig and did not see an
obvious change. I don't intend on running in this state (or tmpfs)
anyhow. The condor_q latency can be anywhere from 0.5 seconds to 10s as
measured by "time condor_q" w/o having an obvious origin in the # of
jobs are in the queue and how many are idle/running. Those on the LIGO
list may recall I've had some NFS-related slowness lately on our cluster
so there may be other reasons for the latency beyond using older disks.
FYI: the man page for condor_reconfig does not successfully convert the
section and page numbers of the PDF into text where it tells you what
variables do not get properly reset under condor_reconfig.
--
Tom Downes
Associate Scientist and Data Center Manager
Center for Gravitation, Cosmology and Astrophysics
University of Wisconsin-Milwaukee
414.229.2678
On Wed, Aug 7, 2013 at 6:11 AM, Pek Daniel <pekdaniel@xxxxxxxxx
<mailto:pekdaniel@xxxxxxxxx>> wrote:
Hi!
So finally I eliminated the possibility of IO bottlenecks by putting
/var/lib/condor and /var/log/condor onto tmpfs. I also set
transfer_executable = False and should_transfer_files = no (against
networking bottleneck).
Now I suspect the bottleneck is the number of context switches:
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu-----
r b swpd free buff cache si so bi bo in cs
us sy id wa st
3 0 10 1325 102 217 0 0 1 25 9 11
3 2 93 2 0
0 0 10 1324 102 217 0 0 0 8 1945 3087
10 8 79 3 0
0 0 10 1323 102 217 0 0 0 0 1964 3105
13 8 80 0 0
0 0 10 1322 102 217 0 0 0 0 2267 3608
12 9 79 0 0
0 0 10 1321 102 217 0 0 0 34 1502 2395
8 6 86 0 0
0 0 10 1320 102 217 0 0 0 0 1969 3088
13 8 79 0 1
0 0 10 1319 102 217 0 0 0 84 2291 3654
12 9 76 4 0
1 0 10 1318 102 217 0 0 0 0 2083 3089
23 10 67 0 0
0 0 10 1317 102 218 0 0 0 0 2070 3303
10 9 81 0 1
0 0 10 1316 102 218 0 0 0 54 1257 1994
6 5 88 2 0
0 0 10 1315 102 218 0 0 0 0 1975 3146
12 8 80 0 0
0 0 10 1314 102 218 0 0 0 0 2375 3810
12 10 79 0 0
0 0 10 1313 102 218 0 0 0 0 2017 3158
13 8 78 0 1
3800/sec seems a little too much. Any idea how can I tune condor or
linux against this?
Thanks,
Daniel
2013/8/2 Todd Tannenbaum <tannenba@xxxxxxxxxxx
<mailto:tannenba@xxxxxxxxxxx>>:
> On 8/2/2013 9:09 AM, Dan Bradley wrote:
>>
>>
>> Be aware that turning off fsync in the condor_schedd can lead to
loss of
>> job state in the event of power loss or other sudden death of the
>> schedd. This could result in jobs that were submitted shortly
before
>> the outage disappearing from the queue without being run. It
could also
>> result in jobs being run twice.
>>
>> If that is acceptable for your purposes, then your problem is
solved. If
>> it is not acceptable, then focus on improving the performance of the
>> filesystem containing $(SPOOL).
>>
>
> FWIW, on our busy submit nodes (dozens of users with typically
thousands of
> running jobs), we put $(SPOOL) on a solid-state drive (SSD).
Specifically,
> we mount the SSD on /ssd and then put in condor_config:
> JOB_QUEUE_LOG = /ssd/condor_spool/job_queue.log
> The above allows us to put the job_queue.log onto the SSD - this
is the
> schedd's jobs queue and the file that gets a lot of fsyncs on
transaction
> boundaries. By using JOB_QUEUE_LOG, we can use a small/cheap SSD
that does
> not have to be large enough to hold the entire contents of the
$(SPOOL)
> directory.
>
> Performance is greatly improved and the risks Dan outlines above are
> avoided.
>
> -Todd
>
>
>> --Dan
>>
>> On 8/2/13 8:14 AM, Pek Daniel wrote:
>>>
>>> Thanks, the FSYNC trick solved the issue! :)
>>>
>>>
>>> 2013/8/1 Dan Bradley <dan@xxxxxxxxxxxx
<mailto:dan@xxxxxxxxxxxx> <mailto:dan@xxxxxxxxxxxx
<mailto:dan@xxxxxxxxxxxx>>>
>>>
>>>
>>>
>>> Are you timing just condor_submit, or are you also timing job
>>> run/completion rates?
>>>
>>> Job submissions cause the schedd to commit a transaction to
>>> $(SPOOL)/job_queue.log. If the disk containing that is slow,
>>> submissions will be slow. One way to verify if this is the
>>> limiting factor is to add the following to your configuration:
>>>
>>> CONDOR_FSYNC = FALSE
>>>
>>> Another thing to keep in mind is that if you can batch
submissions
>>> of many jobs into a single submit file, there will be fewer
>>> transactions.
>>>
>>> --Dan
>>>
>>>
>>> On 8/1/13 10:17 AM, Pek Daniel wrote:
>>>
>>> Hi!
>>>
>>> I'm experimenting with condor: I'm trying to submit a
lot of
>>> dummy
>>> jobs with condor_submit from multiple submission hosts
>>> simultaneously.
>>> I have only a single schedd. I'm trying to stresstest this
>>> schedd.
>>> These jobs are in the vanilla universe.
>>>
>>> The problem is that I couldn't reach better result than 4-6
>>> submission/sec, which seems a little low. I can't see
any real
>>> bottleneck on the machine, so I suspect that it's
because of some
>>> default value of a configuration option which throttles
down the
>>> submission requests.
>>>
>>> Any idea how to solve this?
>>>
>>> Thanks,
>>> Daniel
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to
>>> htcondor-users-request@xxxxxxxxxxx
<mailto:htcondor-users-request@xxxxxxxxxxx>
>>> <mailto:htcondor-users-request@xxxxxxxxxxx
<mailto:htcondor-users-request@xxxxxxxxxxx>> with a
>>>
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx
<mailto:htcondor-users-request@xxxxxxxxxxx>
>>> with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>>
>>
>>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx
<mailto:htcondor-users-request@xxxxxxxxxxx> with
>> a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx
<mailto:tannenba@xxxxxxxxxxx>> University of Wisconsin-Madison
> Center for High Throughput Computing Department of Computer
Sciences
> HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132 <tel:%28608%29%20263-7132>
Madison, WI 53706-1685
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx
<mailto:htcondor-users-request@xxxxxxxxxxx> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
<mailto:htcondor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/