Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu
- Date: Thu, 12 Sep 2024 17:35:01 +0200
- From: Ben Jones <ben.dylan.jones@xxxxxxxxx>
- Subject: Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu
Different version of condor, but Iâve seen something similar on EPs running 23.3.0 and maybe 23.0.1 on rhel9, with the StarterLog.slot* file having messages like:
08/21/24 00:57:34 (pid:1791323) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_pool_condor_slot1_10@xxxxxxxxxxxxxxxxxx: Device or resource busy
..in this case a job that had:
"cumulativeremotesyscpu": 53752,
"cumulativeremoteusercpu": 319097,
"cumulativeslottime": 1308,
No containers involved. Not sure if itâs possible that the cpu.stat file might have values from previous jobs if the tree isnât trimmed as per the error, or if itâs completely unrelated.
> On 12 Sep 2024, at 17:03, Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
> Sorry for the lack of followup. Our expert on this section of the code is traveling and Iâve run out of good ideas for debugging this in the mean time. Heâll be back next week and Iâll poke him to take a look.
>
> I can tell you that when HTCondor launches the job directly (vs asking Docker to run it), the cpu usage numbers for the job are taken from the cgroups file cpu.stat.
>
> - Jaime
>
>> On Aug 29, 2024, at 2:49âAM, Mark Slater <mark.slater@xxxxxxx> wrote:
>>
>> Hi Jaime,
>>
>> Thanks a lot for replying! Just asked at Manchester and the answers are:
>> Can you us the following about your Execution Points:
>>>
>>> + Condor version
>> 23.0.10-1
>>> + linux distro
>> RHEL 9 based, Alma I believe
>>> + is Condor running as root
>> Yes
>>> + whether jobs are run with docker
>> Condor doesn't start a container itself, but the jobs will through singularity (i.e. the job script starts and soon into the run will start the actual payload using a singularity command).
>> Hope that helps!
>>
>> Thanks,
>> Mark
>>
>>
>>>
>>
>>
>>> These affect how HTCondor collects the cpu usage data for the job.
>>>
>>> - Jaime
>>>
>>>
>>>> On Aug 27, 2024, at 12:11âPM, Mark Slater <mark.slater@xxxxxxx> wrote:
>>>>
>>>> Hiya,
>>>>
>>>> In LHCb we've been seeing some of our pilots killed at a couple of sites for using too many CPU cores. Specifically, the output says:
>>>>
>>>>
>>>> scan-condor-job: RemoteWallClockTime=10
>>>> scan-condor-job: RemoteUserCpu=161746
>>>> scan-condor-job: RemoteSysCpu=12643
>>>> scan-condor-job: ImageSize=75000000
>>>> scan-condor-job: ExitStatus=0
>>>> scan-condor-job: JobStatus=3
>>>> scan-condor-job: ExitSignal=15
>>>> scan-condor-job: RemoveReason=The job attribute PeriodicRemove expression '(JobStatus == 1 && NumJobStarts > 0) || (RemoteWallClockTime > 24 * 60 * 60) || (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 * 60 * 60)' evaluated to TRUE
>>>> scan-condor-job: JobCurrentStartDate=1724656973
>>>> scan-condor-job: EnteredCurrentStatus=1724656983
>>>> scan-condor-job: ExitReason=died on signal 15 (Terminated)
>>>> scan-condor-job: RequestCpus=1
>>>> scan-condor-job: -----------------------------------------------------
>>>> scan-condor-job: LRMSStartTime=20240826072253Z
>>>> scan-condor-job: LRMSEndTime=20240826072303Z
>>>> scan-condor-job: ExitCode not found in condor log and condor_history
>>>> scan-condor-job: ---- Requested resources specified in grami file ----
>>>> scan-condor-job: -----------------------------------------------------
>>>> scan-condor-job: PeriodicRemove evaluated to TRUE
>>>> 2024-08-26T07:24:03Z Job state change INLRMS -> FINISHING Reason: Job failure detected
>>>> 2024-08-26T07:24:03Z Job state change FINISHING -> FINISHED Reason: Job failure detected
>>>>
>>>>
>>>> So the condition (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 * 60 * 60) does evaluate to TRUE given the numbers above, but those numbers seem rather crazy, specifically:
>>>>
>>>> scan-condor-job: RemoteWallClockTime=10
>>>> scan-condor-job: RemoteUserCpu=161746
>>>> scan-condor-job: RemoteSysCpu=12643
>>>>
>>>> Unless I'm misunderstanding something, I don't see how a job with a wall time of 10s can use ~170000s of CPU! Has anyone seen this before or have any idea what could be causing it? A few extra (possibly relevant) bits of info:
>>>>
>>>> * This seems to happen randomly and at a low level. The jobs that fail are retried and work fine.
>>>>
>>>> * The values work out to using 2 CPUs over 24 hours (24 * 60 * 60 * 2 CPUs) - not sure if that's significant :)
>>>>
>>>> * This has been seen at 2 sites so far but as it's a low level it could be happening elsewhere as well and is being masked by other things
>>>>
>>>> * I don't know the versions of Condor being run but I would assume *relatively* recent as one of the sites is a large Tier 1
>>>>
>>>>
>>>> Happy to provide more info if possible but we are basically a user in this situation and the admins don't know what's causing it either. Any help would be greatly appreciated!
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mark
>>>>
>>>>
>>>> _______________________________________________
>>>> HTCondor-users mailing list
>>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>>> subject: Unsubscribe
>>>> You can also unsubscribe by visiting
>>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>>
>>>> The archives can be found at:
>>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>>>>
>>> _______________________________________________
>>> HTCondor-users mailing list
>>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>>> subject: Unsubscribe
>>> You can also unsubscribe by visiting
>>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>>
>>> The archives can be found at:
>>> https://lists.cs.wisc.edu/archive/htcondor-users/
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/