Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Does HTC limit CPU usage on macOS?
- Date: Mon, 28 Jul 2025 12:45:01 +1200
- From: Neil Clayton <neil@xxxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] Does HTC limit CPU usage on macOS?
Iâve been learning HTC over the weekend, and have a small cluster working, across 3 machinesm a linux box (50+ cores), a M4 Mac Pro, and some other linux thing.
The macOS box (named speedy4) has condor executor installed manually, as root.
An executor linux box (50+ cores) exists, named âhappyâ.
The central manager + submit + negotiator (everything else, *not* an executor) is on a separate linux box (named: containery).
I can queue + run jobs over the machines in the cluster fine (itâs a java universe btw). Using dynamic slots, jobs are allocated as Iâd expect.
I am queuing directly from âcontaineryâ (a submit node).
What Iâm seeing tho is that a task executing on the Mac M4 runs â. Slowly.
Those that are executing on âhappyâ do what I expect - take up 100% CPU. Letâs ignore that linux box for now, as it seems to be working quite âhappily".
What I am doing, and observing:
The job is a highly multi threaded java process and executes with something like 50 threads internally (Iâm aware this is more than the 14 cores in the M4 Pro).
When I run this from command line on the M4, it almost immediately takes all CPU and will do so for the next 2 hrs quite happily. So, while the thread count of the job might not be optimal for the number of cores, I can observe the CPU is heavily utilized. To the point where the fans go 100% within 10s of starting the job, and stay there.
However; if I submit the job (just 1 âqueueâ, and I have reduced the cluster to just the M4), while it executes on the cluster, it seems â lethargic.
CPU meters (htop) show close to 100% utilization, but it feels as if itâs being throttled somehow, because after a few minutes, CPU usage drops to about 80%. The fans soon stop, and the job takes about 2-3x the time to run (it spits out measurements. Run standalone it executes its first round in ~700s. Run on the cluster, it takes ~1800-2200s)
So, I condor_rm that cluster.
If I then grab the âjavaâ command line, and remove the âchirpâ related classes / args, and run it as root, it spins to 100% almost immediately. Fans kick in soon after, and stay that way. This leads me to think my problem is not to do with the java command, nor how it is being run.
So:
1) Does Condor limit actual CPU usage when a slot is running?
2) Is there something I might look for in the logs, that might help me discover what is happening?
3) What else might I be missing here?
What Iâve tried so far:
- forcing the condor executor on the M4 to run the job as a separate newly created used (instead of nobody).
- dbl checked the process has sufficient VM space (it runs fine with same params from command line), so its not slowing down due to VM GC thrashing
----
Regards,
Neil Clayton
Other probably important background:
- The installed condor is x86 (so Rosetta emulation). The JVM however is definitely arm64:
sh-3.2# file /Library/Java/JavaVirtualMachines/zulu-21.jdk/Contents/Home/bin/java
/Library/Java/JavaVirtualMachines/zulu-21.jdk/Contents/Home/bin/java: Mach-O 64-bit executable arm64
- The M4 is the machine Iâm using terminal on, my main machine, what Iâm using to queue jobs, talk to other hosts. Itâs not doing much else at the time though I am obviously using itâs keyboard.
- The linux box (named: happy) is an executor installed with the standard "curl -fsSL https://get.htcondor.org | sudo GET_HTCONDOR_PASSWORD="$htcondor_password" /bin/bash -s -- --no-dry-run --execute $central_manager_nameâ
- All linux = ubuntu24.
==========
Here is the submit job, for reference:
# Train the thing
ALGO = o2-12m34m
DATASET = GENO-S1
EVALTHREADS = 50
# Because windows screws up all the output
requirements = OpSys != "WINDOWS"
executable = simulate-train.jar
arguments = simulate.learning.gp.SimpleEvolve -file learning/$(ALGO)/train.params -dataset $(DATASET) -algo $(ALGO) -seed $(Process) -p evalthreads=$(EVALTHREADS) -p stat.file=$(ALGO)/$(DATASET)/0/job.$(Process).out.stat
universe = java
jar_files = simulate-train.jar
java_vm_args = -Xmx30g
output = logs/$(ClusterId)/training.out_$(Process).txt
error = logs/$(ClusterId)/training.err_$(Process).txt
log = logs/training-summary.log
max_retries = 1
# This, from https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToUseAllCpus
request_cpus = Cpus >= 50 ? 50 : max({10, Cpus})
# Prefer machines with higher cpus installed
rank = Cpus
request_memory = 33G
request_disk = 2000MB
should_transfer_files = yes
preserve_relative_paths = true
# gphhemd_data/output/empty.txt is transferred so that the /output/ folder is created on the executor node
transfer_input_files = gphhemd_data/parameters/,gphhemd_data/datasets/emd/,gphhemd_data/emd-ors-config.json,gphhemd_data/osm_files/wgtn.osm.gz,gphhemd_data/output/empty.txt
when_to_transfer_output = on_exit
transfer_output_files = gphhemd_data/output
queue 1