|
Once nice feature of HTCondor is that the Shadow updates the job’s RemoteUserCpu and RemoteSysCpu attributes periodically while the job is running. Using a few ClassAd expressions, you can calculate the job’s current CPU utilization percentage with a five-minute window, the interval between shadow updates to the job’s ClassAd. Here’s what I have from
my old config. It dates back to early 2016, before the ternary operator, so please double-check the syntax. First, a TotalExecutingTime that shows how long the job has been running minus time suspended where it would be allocated but not accruing any CPU time. Suspension handling is probably not too important for
most cases, so simplify it if you wish. ( ifThenElse(! isUndefined(RemoteWallClockTime), \ RemoteWallClockTime, 0) - \ ifThenElse(! isUndefined(CumulativeSuspensionTime), \ CumulativeSuspensionTime, 0) \ ) + \ ( ifThenElse(JobStatus == 2, \ CurrentTime - JobCurrentStartDate, 0) \ ) + \ ( ifThenElse(JobStatus == 7, \ LastSuspensionTime - JobCurrentStartDate, 0) \ ) Next, a RemoteCpuUtilizationPercent that covers the user plus system time, normalized to the number of CPUs the job requested. RemoteCpuUtilizationPercent = \ ifThenElse(! isUndefined(TotalExecutingTime) && TotalExecutingTime > 0, \ ((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / TotalExecutingTime * 100, \ UNDEFINED) The RemoteUserCpu and RemoteSysCpu time accrued will ideally be as close as possible to the number of CPUs requested times the wallclock time the job has been actively running, and this turns that into a percentage.
A job that requested 32 Cpus, thus 32 Cpu-hours per hour, should have a User+Sys CPU time as close to 32 hours per hour as possible. The following two expressions allow you to separate out the user and system CPU time, if you’re interested in that.
RemoteUserCpuUtilizationPercent = \ ifThenElse(! isUndefined(TotalExecutingTime) && TotalExecutingTime > 0, \ (RemoteUserCpu / RequestCpus) / TotalExecutingTime * 100, \ UNDEFINED) RemoteSysCpuUtilizationPercent = \ RemoteCpuUtilizationPercent - RemoteUserCpuUtilizationPercent And then, of course, they need to get added to job submissions’ attributes at submit time:
SUBMIT_EXPRS = $(SUBMIT_EXPRS) TotalExecutingTime RemoteCpuUtilizationPercent \ RemoteUserCpuUtilizationPercent RemoteSysCpuUtilizationPercent So for the job policy, you might look for a job that’s been running at least an hour or so (TotalExecutingTime >= 3600), and if its RemoteCpuUtilizationPercent is still in the single digits (RemoteCpuUtilizationPercent
< 10), flag it for futher action. Michael V Pelletier Principal Technologist
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx>
On Behalf Of CMV Hello everyone, Every now and then someone requests a huge amount of resources and then leaves them unused because the job can't / won't make use of all those resources.
Needless to say this will prevent other users from using the cluster. Any Hello everyone,
Every now and then someone requests a huge amount of resources and then
leaves them unused because the job can't / won't make use of all those
resources. Needless to say this will prevent other users from using the
cluster. Any suggestion(s) about how to detect this particular type of
usage pattern?
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
The archives can be found at: https://urldefense.us/v2/url?u=https-3A__www-2Dauth.cs.wisc.edu_lists_htcondor-2Dusers_&d=DwICAg&c=MASr1KIcYm9UGIT-jfIzwQg1YBeAkaJoBtxV_4o83uQ&r=4PJgb1eyyvhzSV4fRwSECPTkK8m1eAOr1iulObpXBdA&m=MA7mLUNbqv8z2L8Pg3hpBlErlRlZR5AefkDORwOeviBdh3RpEP8UsCNFW-XmP9Vd&s=HjjM_wPm5GL_J7qqOFjOYLLgMp892HqAJXunVUAXka8&e=
|