Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Sat, 14 Sep 2024 12:54:46 +0200

Subject: Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu

We observe probably same issue on machines with stuck CVMFS. It seems that HTCondor is not able to destroy cgroups subgroups which contains processes that can't be killed even with SIGKILL, e.g.

[root@mff1701 ~]# cat /sys/fs/cgroup/system.slice/htcondor/condor_scratch_condor_slot1_2@xxxxxxxxxxxxxxxxxxxxxxxx/cgroup.procs | head -1
3937544
[root@mff1701 ~]# ps -ef | grep 3937544
root     2910331 2104404  0 12:41 pts/95   00:00:00 grep --color=auto 3937544
atlassg+ 3937544       1  0 Aug22 ?        00:00:00 ls /cvmfs/alice.cern.ch
[root@mff1701 ~]# kill -9 3937544
[root@mff1701 ~]# ps -ef | grep 3937544
root     2910340 2104404  0 12:41 pts/95   00:00:00 grep --color=auto 3937544
atlassg+ 3937544       1  0 Aug22 ?        00:00:00 ls /cvmfs/alice.cern.ch
[root@mff1701 ~]# cat /var/log/condor/StarterLog.slot1_2
...
09/14/24 05:57:04 (pid:2644825) About to exec /scratch/condor/dir_2644825/condorjob.sh 
09/14/24 05:57:04 (pid:2644832) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup system.slice/htcondor/condor_scratch_condor_slot1_2@xxxxxxxxxxxxxxxxxxxxxxxx: Device or resource busy
09/14/24 05:57:04 (pid:2644832) Successfully moved procid 2644832 to cgroup /sys/fs/cgroup/system.slice/htcondor/condor_scratch_condor_slot1_2@xxxxxxxxxxxxxxxxxxxxxxxx/cgroup.procs
09/14/24 05:57:04 (pid:2644825) Create_Process succeeded, pid=2644832
...

Even though HTCondor is not able to cleanup this cgroups subgroup for slot1_2, it is still reused and it seems to me that its time is accounted to a new job => jobs that reuse slot1_2 gets RemoteSysCpu and RemoteUserCpu form all jobs since HTCondor started.

I think HTCondor should never reuse existing cgroup slot subgroup...

Petr

On 9/12/24 18:15, Jaime Frey via HTCondor-users wrote:

This looks very suspicious. It looks like we do this operation to clear out all resource usage numbers for a cgroup before reusing it for a new job.

 - Jaime

On Sep 12, 2024, at 10:35âAM, Ben Jones <ben.dylan.jones@xxxxxxxxx> wrote:

Different version of condor, but Iâve seen something similar on EPs running 23.3.0 and maybe 23.0.1 on rhel9, with the StarterLog.slot* file having messages like:

08/21/24 00:57:34 (pid:1791323) ProcFamilyDirectCgroupV2::trimCgroupTree error removing cgroup htcondor/condor_pool_condor_slot1_10@xxxxxxxxxxxxxxxxxx: Device or resource busy

..in this case a job that had:

"cumulativeremotesyscpu": 53752,
"cumulativeremoteusercpu": 319097,
"cumulativeslottime": 1308,

No containers involved. Not sure if itâs possible that the cpu.stat file might have values from previous jobs if the tree isnât trimmed as per the error, or if itâs completely unrelated.

On 12 Sep 2024, at 17:03, Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Sorry for the lack of followup. Our expert on this section of the code is traveling and Iâve run out of good ideas for debugging this in the mean time. Heâll be back next week and Iâll poke him to take a look. 

I can tell you that when HTCondor launches the job directly (vs asking Docker to run it), the cpu usage numbers for the job are taken from the cgroups file cpu.stat.

- Jaime

On Aug 29, 2024, at 2:49âAM, Mark Slater <mark.slater@xxxxxxx> wrote:

Hi Jaime,

Thanks a lot for replying! Just asked at Manchester and the answers are:
Can you us the following about your Execution Points:

+ Condor version

23.0.10-1

+ linux distro

RHEL 9 based, Alma I believe

+ is Condor running as root

Yes

+ whether jobs are run with docker

Condor doesn't start a container itself, but the jobs will through singularity (i.e. the job script starts and soon into the run will start the actual payload using a singularity command).
Hope that helps!

Thanks,
Mark

These affect how HTCondor collects the cpu usage data for the job.

- Jaime

On Aug 27, 2024, at 12:11âPM, Mark Slater <mark.slater@xxxxxxx> wrote:

Hiya,

In LHCb we've been seeing some of our pilots killed at a couple of sites for using too many CPU cores. Specifically, the output says:


scan-condor-job: RemoteWallClockTime=10
scan-condor-job: RemoteUserCpu=161746
scan-condor-job: RemoteSysCpu=12643
scan-condor-job: ImageSize=75000000
scan-condor-job: ExitStatus=0
scan-condor-job: JobStatus=3
scan-condor-job: ExitSignal=15
scan-condor-job: RemoveReason=The job attribute PeriodicRemove _expression_ '(JobStatus == 1 && NumJobStarts > 0) || (RemoteWallClockTime > 24 * 60 * 60) || (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 * 60 * 60)' evaluated to TRUE
scan-condor-job: JobCurrentStartDate=1724656973
scan-condor-job: EnteredCurrentStatus=1724656983
scan-condor-job: ExitReason=died on signal 15 (Terminated)
scan-condor-job: RequestCpus=1
scan-condor-job: -----------------------------------------------------
scan-condor-job: LRMSStartTime=20240826072253Z
scan-condor-job: LRMSEndTime=20240826072303Z
scan-condor-job: ExitCode not found in condor log and condor_history
scan-condor-job: ---- Requested resources specified in grami file ----
scan-condor-job: -----------------------------------------------------
scan-condor-job: PeriodicRemove evaluated to TRUE
2024-08-26T07:24:03Z Job state change INLRMS -> FINISHING Reason: Job failure detected
2024-08-26T07:24:03Z Job state change FINISHING -> FINISHED Reason: Job failure detected


So the condition (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 * 60 * 60) does evaluate to TRUE given the numbers above, but those numbers seem rather crazy, specifically:

scan-condor-job: RemoteWallClockTime=10
scan-condor-job: RemoteUserCpu=161746
scan-condor-job: RemoteSysCpu=12643

Unless I'm misunderstanding something, I don't see how a job with a wall time of 10s can use ~170000s of CPU! Has anyone seen this before or have any idea what could be causing it? A few extra (possibly relevant) bits of info:

* This seems to happen randomly and at a low level. The jobs that fail are retried and work fine.

* The values work out to using 2 CPUs over 24 hours (24 * 60 * 60 * 2 CPUs) - not sure if that's significant :)

* This has been seen at 2 sites so far but as it's a low level it could be happening elsewhere as well and is being masked by other things

* I don't know the versions of Condor being run but I would assume *relatively* recent as one of the sites is a large Tier 1


Happy to provide more info if possible but we are basically a user in this situation and the admins don't know what's causing it either. Any help would be greatly appreciated!


Thanks,

Mark


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu