[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Missing `CpusUsage` line in the history



Hi Thomas and Max,
Thank you so much for the comments.
1. As for the CpusUsage, I understand that a task can use more CPU resources available. That might be true as we don't have any policy in HTCondor configuration to restrict the CPU usage and we request `exclusive` nodes from SLURM.
2. However, the memoryusage case, we have policies enforced to HTCondor Execute nodes like the below. Thus, we expected the task can't use that much memory space. Please correct me if I'm missing something.

```
CGROUP_MEMORY_LIMIT_POLICY=none
MEMORY_EXCEEDED_SLOT = False
MEMORY_EXCEEDED_REQ = False
VIRTUAL_MEMORY_AVAILABLE_MB = 0.0
MEMORY_EXCEEDED_VM = False

# Check jobs if they are using more than 10% over memory assigned to the slot
MEMORY_EXCEEDED_SLOT = ( isDefined(MemoryUsage) && isDefined(Memory) && MemoryUsage*1.1 > Memory )
# Check jobs if they are using more than 10% over the requested memory
MEMORY_EXCEEDED_REQ = ( isDefined(MemoryUsage) && isDefined(RequestMemory) && MemoryUsage > RequestMemory*1.1 )
# Check jobs if they are using too much vm
VIRTUAL_MEMORY_AVAILABLE_MB = ( isDefined(VirtualMemory) && VirtualMemory*0.9 )
MEMORY_EXCEEDED_VM = ( isDefined(ImageSize) && ImageSize/1024 > $(VIRTUAL_MEMORY_AVAILABLE_MB) )

MEMORY_EXCEEDED = ($(MEMORY_EXCEEDED_SLOT)) || ($(MEMORY_EXCEEDED_REQ)) || ($(MEMORY_EXCEEDED_VM))
# Return 137 for memory overuseuse POLICY : WANT_HOLD_IF(MEMORY_EXCEEDED, 137, memory usage exceeded the requested memory)

# If Memory Exceded, Evict the job
PREEMPT = ($(PREEMPT)) || ($(MEMORY_EXCEEDED))

# Suspend and hold the job
WANT_SUSPEND = ($(WANT_SUSPEND)) && ($(MEMORY_EXCEEDED)) =!= TRUE
WANT_HOLD = ($(MEMORY_EXCEEDED))
# Message to Job's owner
WANT_HOLD_REASON = ifThenElse( $(WANT_HOLD), "Job exceeded available memory.", undefined )
```


Thank you!

Best,
Seung-Jin

On Thu, Jul 11, 2024 at 1:25âAM KÃhn, Max (SCC) <max.fischer@xxxxxxx> wrote:
Hi Seung-Jin,

The CpusUsage *should* be lower or equal than RequestCpus. However, this is only the case if the job actually only uses the resources it requested!
By default, there is no mechanism to restrict a job to just the resources it requested. This is true for both Memory and CPU - if the machine has more resources available, the job can use them.

HTCondor ships with some optional mechanisms to restrict resource usage. For example, CGROUP_MEMORY_LIMIT_POLICY and related settings are effective for setting memory boundaries.
You can also use the variousÂSYSTEM_PERIODIC_* settings to create your own rules for stopping misbehaving jobs. For example, we put on hold jobs whose CpusUsage exceeds their RequestCpus by a small grace factor.

Cheers,
Max

On 11. Jul 2024, at 00:40, Seung-Jin Sul <ssul@xxxxxxx> wrote:

Hi Jaime,

Can I ask questions?

1. We are using the equation you suggested to calculate the CpusUsage (=cpu utilization).
The task requests 2 cpus and we can calculate the CpusUsage like


RequestCpus = 2
(RemoteSysCpu + RemoteUserCpu) / CommittedTime
(180.0 + 212.0) /25 =15.68

How can we interpret the values? How come (RemoteSysCpu + RemoteUserCpu) is bigger than `CommittedTime`?
is CpusUsage supposed to be equal to or lower than the `RequestCpus`?

2. We have another question on `MemoryUsage`.
We have the below numbers

condor_history -l 1129923 | grep ResidentSetSize
MemoryUsage = ((ResidentSetSize + 1023) / 1024)
ResidentSetSize = 7500000
RequestMemory = 2048MB
((7500000 + 1023) / 1024) = 7325.2 MB

The question is how come a task can use 7325MB even the `RequestMemory` is 2048MB.
Please note that we also have a policy like

```
Requirements = (isUndefined(TARGET.AliveUntil) ? true : TARGET.AliveUntil > time() + (0 * 60)) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
```

Any comments will be appreciated.

Thank you.

Best regards,




Seung-Jin Sul, Ph.D.
pronouns: he / his
510-495-8456ÂÂ Â | Â Âssul@xxxxxxx
Staff Software Developer, Tech Lead
Advanced Analysis Group
DOE Joint Genome Institute
Lawrence Berkeley National Lab


On Mon, Apr 1, 2024 at 8:27âAM Seung-Jin Sul <ssul@xxxxxxx> wrote:
Thank you so much for the info, Jaime.

On Mon, Apr 1, 2024, 8:17âAM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
CpusUsage is calculated periodically by the condor_startd and communicated to the condor_starter, which includes it in updates to the condor_shadow on the Access Point. I believe it can be undefined for very short jobs.
RemoteSysCpu and RemoteUserCpu are obtained from the OSâs rusage data when the job exits, and thus are more reliably reported.

For a completed job that had one execution attempt, the value (RemoteSysCpu + RemoteUserCpu) / CommittedTime should be close to CpusUsage.

Two things to note that can throw off any computations:
* CommittedTime and RemoteWallClockTime include the time to transfer input files before the job starts.
* Some of these attributes (e.g. RemoteWallClockTime) are a total across multiple execution attempts.

Â- Jaime

> On Mar 26, 2024, at 2:01âPM, Seung-Jin Sul <ssul@xxxxxxx> wrote:
>
> Hi,
> We are parsing out the history log file to extract the cpu usage data. We think the `cpusuage` value is very useful but we found the `cpususage` line is missing for some tasks. Could someone explain why the line is missing in some cases?
>
> And what could be the best way to calculate the cpu utilization if we can't get the `cpusuage`? The values we can use are (again cpususage may not exist),
> ```
> CommittedSuspensionTime = 0
> CommittedTime = 1071
> (CpusUsage = 1.185198573018748)
> RemoteSysCpu = 1242.0
> RemoteUserCpu = 4793.0
> RemoteWallClockTime = 1071.0
> RequestCpus = 16
> ```
>
> Thank you!
> Seung
>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/