[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Condor Messing with Performance Counters



Good day,

we are planning to use LIKWID with its perfctr tool to monitor job performance 
on our HTCondor 9.10 GPU cluster (Docker universe only).
Our execution nodes are on Ubuntu 20.04, i.e. Linux 5.4 LTS.

When running a job via HTCondor on a machine, the performance counters get 
confused, though.
E.g. when running the following command (listening for double precision flops 
for 10 seconds), I get tens of teraflops reported on some cores, which they 
obviously are not capable of:
> likwid-perfctr -f -g FLOPS_DP -S 10s

The effect sometimes persistst even when no jobs are running on the machine 
anymore, i.e. an idle machine reports teraflops. This is up until stopping 
HTCondor:
> systemctl stop condor

Profiling an actually flop intensive application does return sane results when 
condor is stopped. E.g.:
> likwid-perfctr -m -f -g FLOPS_DP likwid-bench -t peakflops_sse -w S0:12800kB 
-w S1:12800kB

After restarting HTCondor I only have observed the faulty behaviour once I ran 
another HTCondor job on the machine.

I would assume this might be due to the use of performance counters for 
instruction counting in HTCondor interfering with other performance counting 
applications.
Is there any way to simply disable HTCondor's "CPUInstructions" counting?

Are there any other ways you can think of, how running a job via HTCondor 
might have an effect on performance counting applications?

Just to provide the additional information: I have not been able to reproduce 
this by just running a docker container on the same machine while measuring.
Also, it does not have to be a special application run by the job, executing 
sleep in a plain ubuntu:latest docker container does have the same effect.

Thanks for any suggestions how to deal with this in advance!

Best,
- Joachim Meyer