Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Access point scale
- Date: Fri, 26 Jan 2024 10:30:23 -0600
- From: Greg Thain <gthain@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Access point scale
On 1/26/24 04:36, Dudu Handelman wrote:
Hi David:
At some point, we'll just need to profile the schedd with
bpftrace/strace to know for certain what is going on. Without that,
though, an couple of issues, probably you know about them. The first
indication that the schedd is overloaded is that the
RecentDaemonCoreDuty cycle is approaching 1.0. I assume your schedd is
in this neighborhood?
o) As you mentioned, the most important file to put on ssd/nvme is the
job_queue.log, but the schedd also writes the user event.log to disk, so
you might want to double check that the job event logs are not on a slow
disk.
o) Make sure the schedd and shadow do not have D_FULLDEBUG or other very
verbose flags in their DEBUG levels.
o) What version of HTCondor are you running? 23.2 has an improvement in
the speed of the schedd when running with a large fd limit:
https://github.com/htcondor/htcondor/pull/1907
o) When there are a lot of jobs in the queue, condor_q can eat a lot of
time out of the schedd. condor_watch_q can show a lot of similar
information as condor_q, but without bothering the schedd
-greg