Re: [HTCondor-users] Avoiding CPU wastage

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On Mon, May 13, 2019 at 3:30 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Hello Team,

Referring [1], [2] old email threads I am testing in lab to take action on the jobs which are running for more than 180s with both higher and lower CPU utilization.Â

- In following submit file I am generating load using stress and if the CPU utilization goes about .4 then putting the job on hold and releasing it using periodic_release so that it can get schedule on another node. Strangely job is going into hold status within 11s of running as per remotewallclocktime, parameters used for evaluating the _expression_ should return value greater than .4 not sure why hold reason is showing condition is UNDEFINED.

~~~
$ cat stress.shÂ
#!/bin/bash
stress --cpu 1 -t 360

$ cat stress.sub
executableÂ Â Â Â Â Â Â = sleep.sh
logÂ Â Â Â Â Â Â Â Â Â Â= stress.log
outputÂ Â Â Â Â Â Â Â Â = outfile$(Process).txt
errorÂ Â Â Â Â Â Â Â Â Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) > 40)
periodic_hold_reason = "Using cpu more than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_filesÂ Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue

$ condor_q 2564.0

-- Schedd: testmachine : <IPaddress:9618?... @ 05/13/19 05:32:39
OWNERÂ Â ÂBATCH_NAMEÂ Â Â Â SUBMITTEDÂ ÂDONEÂ ÂRUNÂ Â IDLEÂ ÂHOLDÂ TOTAL JOB_IDS
vaggarwal CMD: stress.shÂ Â5/13 05:30Â Â Â _Â Â Â _Â Â Â _Â Â Â 1Â Â Â 1 2564.0

1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

$ condor_q 2566.0 -af holdreason holdreasoncode
The job attribute PeriodicHold _expression_ '( ( ( ( RemoteSysCpu + RemoteUserCpu ) / RequestCpus ) / ifthenelse(JobStatus == 2 && ( time() - JobCurrentStartDate ) > 300,( time() - JobCurrentStartDate ),0) * 100 ) > 40 )' evaluated to UNDEFINED 5

$ condor_q 2564.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 8.0 1 11.0
~~~

- If I am running the same scenario for sleep job, it's working as expected. Job went into hold status 3 times and during the last time because of false PeriodicRelease condition, it remains in hold status.Â

~~~
cat sleep.sh
#!/bin/bash
# file name: sleep.sh

TIMETOWAIT="1020"
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAITÂ

$ cat sleep.subÂ
executableÂ Â Â Â Â Â Â = sleep.sh
logÂ Â Â Â Â Â Â Â Â Â Â= sleep.log
outputÂ Â Â Â Â Â Â Â Â = outfile$(Process).txt
errorÂ Â Â Â Â Â Â Â Â Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) < 20)
periodic_hold_reason = "Using cpu less than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_filesÂ Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue

$ condor_q 2555.0 -af holdreason holdreasoncode
Using cpu less than threshold 3

$ condor_q 2555.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 0.0 1 301.0
~~~

My objective is to hold the job if it's not doing any activity which seems to be working fine but I want to confirm the other way around as well to ensure that behavior is as expected.Â

Some queries related to RemoteSysCpu RemoteUserCpu

- During testing I observed that RemoteSysCpu RemoteUserCpu are getting change only when job status is changed otherwise they remain 0.
- Also are these parameters are accumulative like remotewallclocktime?

[1] https://www-auth.cs.wisc.edu/lists/htcondor-users/2017-April/msg00103.shtml
[2] https://www-auth.cs.wisc.edu/lists/htcondor-users/2018-September/msg00092.sht

Thanks & Regards,

Vikrant Aggarwal

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Avoiding CPU wastage