Hi Dudu,
I am probably not a big help but I can tell you that a powerfull sched can hold this kind of numbers of jobs but the condor design is not optimized for that.
From my experience on the AP the main bottleneck is the state transaction file 'JOB_QUEUE_LOG' if you have not done so, put it on a fast SSD - it helps a lot.
Also the shared storage is usually a nuisance, especially for the log files which are constantly written by the shadows. Every running job has a shadow that keeps an open file handle for the individual job log file.
If that location is on a shared filesystem it will cause grief !
We ended up running native GPFS on the scheds in order to get decent responsitivity and overall performance as most of our users use it as a logfile location ...
Maybe this helps a little bit - would be interested in any gain of knowledge you get on this too !!!
Best
christoph
--
Christoph Beyer
DESY Hamburg
IT-Department
Notkestr. 85
Building 02b, Room 009
22607 Hamburg
phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx
Von: "Dudu Handelman" <duduhandelman@xxxxxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Freitag, 26. Januar 2024 11:36:18
Betreff: [HTCondor-users] Access point scale
Hi All.
We have just added some cores to our cluster now a single user access point might have 40k jobs in a running state. The jobs are short
Probably 20 minutes some are less.
I know the basic.
No swap
File descriptors without a limit
Use a physical server.
Use nvme/ssd
Sufficient cores and ram.
I'm using sharedport that complain that the server was too busy to answer in some cases.
Sometimes condor_q is not responding.
But the main issue is that while condor_q show 25k running jobs condor_q -run shows that only 15k jobs have a slot.
Which means that the resource is claimed but not running yet.
Because the jobs are short running it never uses all the resources which it claimed.
Some extra information
Using docker universe
Using shared storage
Try to minimize file transfers.
Not streaming outputs or error
What will improve the Performance? Please share from your experience
Many thanks
David
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/