[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Missing `CpusUsage` line in the history



Hi Seung-Jin,

just a quick idea - but how full are you execution points?
If you have unclaimed cycles on a worker, it might be that the job (when having more threads or so than the requested core count weight) is jumping on the free cpu times.

For the RSS I would check if the job has a lot of shared pages (or maybe cached pages). If your cluster does not enforce strict memory limits, than a job might get allowed to go beyond the requested limits (provided that there is sufficient unclaimed memory available) - but might live with the risk, that it gets killed when memory gets close.

Cheers,
  Thomas

On 11/07/2024 00.40, Seung-Jin Sul wrote:
Hi Jaime,

Can I ask questions?

1. We are using the equation you suggested to calculate the CpusUsage (=cpu utilization).
The task requests 2 cpus and we can calculate the CpusUsage like


    RequestCpus = 2
    (RemoteSysCpu + RemoteUserCpu) / CommittedTime
    (180.0 + 212.0) /25 =15.68


How can we interpret the values? How come (RemoteSysCpu + RemoteUserCpu) is bigger than `CommittedTime`?
is CpusUsage supposed to be equal to or lower than the `RequestCpus`?

2. We have another question on `MemoryUsage`.
We have the below numbers

    condor_history -l 1129923 | grep ResidentSetSize
    MemoryUsage = ((ResidentSetSize + 1023) / 1024)
    ResidentSetSize = 7500000
    RequestMemory = 2048MB
    ((7500000 + 1023) / 1024) = 7325.2 MB


The question is how come a task can use 7325MB even the `RequestMemory` is 2048MB.
Please note that we also have a policy like

```
Requirements = (isUndefined(TARGET.AliveUntil) ? true : TARGET.AliveUntil > time() + (0 * 60)) && (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && ((TARGET.FileSystemDomain == MY.FileSystemDomain) || (TARGET.HasFileTransfer))
```

Any comments will be appreciated.

Thank you.

Best regards,




Seung-Jin Sul, Ph.D.
pronouns: he / his
510-495-8456ÂÂ Â | ssul@xxxxxxx <mailto:ssul@xxxxxxx>
Staff Software Developer, Tech Lead
Advanced Analysis Group
DOE Joint Genome Institute
Lawrence Berkeley National Lab


On Mon, Apr 1, 2024 at 8:27âAM Seung-Jin Sul <ssul@xxxxxxx <mailto:ssul@xxxxxxx>> wrote:

    Thank you so much for the info, Jaime.

    On Mon, Apr 1, 2024, 8:17âAM Jaime Frey via HTCondor-users
    <htcondor-users@xxxxxxxxxxx <mailto:htcondor-users@xxxxxxxxxxx>> wrote:

        CpusUsage is calculated periodically by the condor_startd and
        communicated to the condor_starter, which includes it in updates
        to the condor_shadow on the Access Point. I believe it can be
        undefined for very short jobs.
        RemoteSysCpu and RemoteUserCpu are obtained from the OSâs rusage
        data when the job exits, and thus are more reliably reported.

        For a completed job that had one execution attempt, the value
        (RemoteSysCpu + RemoteUserCpu) / CommittedTime should be close
        to CpusUsage.

        Two things to note that can throw off any computations:
        * CommittedTime and RemoteWallClockTime include the time to
        transfer input files before the job starts.
        * Some of these attributes (e.g. RemoteWallClockTime) are a
        total across multiple execution attempts.

         Â- Jaime

         > On Mar 26, 2024, at 2:01âPM, Seung-Jin Sul <ssul@xxxxxxx
        <mailto:ssul@xxxxxxx>> wrote:
         >
         > Hi,
         > We are parsing out the history log file to extract the cpu
        usage data. We think the `cpusuage` value is very useful but we
        found the `cpususage` line is missing for some tasks. Could
        someone explain why the line is missing in some cases?
         >
         > And what could be the best way to calculate the cpu
        utilization if we can't get the `cpusuage`? The values we can
        use are (again cpususage may not exist),
         > ```
         > CommittedSuspensionTime = 0
         > CommittedTime = 1071
         > (CpusUsage = 1.185198573018748)
         > RemoteSysCpu = 1242.0
         > RemoteUserCpu = 4793.0
         > RemoteWallClockTime = 1071.0
         > RequestCpus = 16
         > ```
         >
         > Thank you!
         > Seung
         >


        _______________________________________________
        HTCondor-users mailing list
        To unsubscribe, send a message to
        htcondor-users-request@xxxxxxxxxxx
        <mailto:htcondor-users-request@xxxxxxxxxxx> with a
        subject: Unsubscribe
        You can also unsubscribe by visiting
        https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
        <https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users>

        The archives can be found at:
        https://lists.cs.wisc.edu/archive/htcondor-users/
        <https://lists.cs.wisc.edu/archive/htcondor-users/>


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature