Hello Team,
Referring [1], [2] old email threads I am testing in lab to take action on the jobs which are running for more than 180s with both higher and lower CPU utilization.Â
- In following submit file I am generating load using stress and if the CPU utilization goes about .4 then putting the job on hold and releasing it using periodic_release so that it can get schedule on another node. Strangely job is going into hold status within 11s of running as per remotewallclocktime, parameters used for evaluating the _expression_ should return value greater than .4 not sure why hold reason is showing condition is UNDEFINED.
~~~
$ cat stress.shÂ
#!/bin/bash
stress --cpu 1 -t 360
$ cat stress.sub
executable       = sleep.sh
log          Â= stress.log
output         = outfile$(Process).txt
error         Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) > 40)
periodic_hold_reason = "Using cpu more than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_files Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue
$ condor_q 2564.0
-- Schedd: testmachine : <IPaddress:9618?... @ 05/13/19 05:32:39
OWNERÂ Â ÂBATCH_NAMEÂ Â Â Â SUBMITTEDÂ ÂDONEÂ ÂRUNÂ Â IDLEÂ ÂHOLDÂ TOTAL JOB_IDS
vaggarwal CMD: stress.sh Â5/13 05:30   _   _   _   1   1 2564.0
1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
$ condor_q 2566.0 -af holdreason holdreasoncode
The job attribute PeriodicHold _expression_ '( ( ( ( RemoteSysCpu + RemoteUserCpu ) / RequestCpus ) / ifthenelse(JobStatus == 2 && ( time() - JobCurrentStartDate ) > 300,( time() - JobCurrentStartDate ),0) * 100 ) > 40 )' evaluated to UNDEFINED 5
$ condor_q 2564.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 8.0 1 11.0
~~~
- If I am running the same scenario for sleep job, it's working as expected. Job went into hold status 3 times and during the last time because of false PeriodicRelease condition, it remains in hold status.Â
~~~
cat sleep.sh
#!/bin/bash
# file name: sleep.sh
TIMETOWAIT="1020"
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAITÂ
$ cat sleep.subÂ
executable       = sleep.sh
log          Â= sleep.log
output         = outfile$(Process).txt
error         Â= errors$(Process).txt
runtime = (time() - JobCurrentStartDate)
TotalExecutingTime = ifthenelse(JobStatus == 2 && $(runtime) > 300, $(runtime), 0)
RemoteCpuUtilizationPercent = (((RemoteSysCpu + RemoteUserCpu) / RequestCpus) / $(TotalExecutingTime) * 100)
periodic_hold = ($(RemoteCpuUtilizationPercent) < 20)
periodic_hold_reason = "Using cpu less than threshold"
periodic_hold_subcode = 30
PeriodicRelease = (JobRunCount < 5 && HoldReasonCode == 3 && $(periodic_hold_subcode) == 30)
should_transfer_files Â= Yes
when_to_transfer_output = ON_EXIT
Initialdir = dir$(Process)
queue
$ condor_q 2555.0 -af holdreason holdreasoncode
Using cpu less than threshold 3
$ condor_q 2555.0 -af RemoteSysCpu RemoteUserCpu RequestCpus remotewallclocktime
0.0 0.0 1 301.0
~~~
My objective is to hold the job if it's not doing any activity which seems to be working fine but I want to confirm the other way around as well to ensure that behavior is as expected.Â
Some queries related to RemoteSysCpu RemoteUserCpu
- During testing I observed that RemoteSysCpu RemoteUserCpu are getting change only when job status is changed otherwise they remain 0.
- Also are these parameters are accumulative like remotewallclocktime?
Thanks & Regards,
Vikrant Aggarwal