Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Access point scale
- Date: Fri, 26 Jan 2024 10:36:18 +0000
- From: Dudu Handelman <duduhandelman@xxxxxxxxxxx>
- Subject: [HTCondor-users] Access point scale
Hi All.
We have just added some cores to our cluster now a single user access point might have 40k jobs in a running state. The jobs are short
Probably 20 minutes some are less.
I know the basic.
No swap
File descriptors without a limit
Use a physical server.
Use nvme/ssd
Sufficient cores and ram.
I'm using sharedport that complain that the server was too busy to answer in some cases.
Sometimes condor_q is not responding.
But the main issue is that while condor_q show 25k running jobs condor_q -run shows that only 15k jobs have a slot.
Which means that the resource is claimed but not running yet.
Because the jobs are short running it never uses all the resources which it claimed.
Some extra information
Using docker universe
Using shared storage
Try to minimize file transfers.
Not streaming outputs or error
What will improve the Performance? Please share from your experience
Many thanks
David