[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu

Date: Tue, 27 Aug 2024 18:11:14 +0100
From: Mark Slater <mark.slater@xxxxxxx>
Subject: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu

Hiya,

In LHCb we've been seeing some of our pilots killed at a couple of sitesfor using too many CPU cores. Specifically, the output says:



scan-condor-job: RemoteWallClockTime=10
scan-condor-job: RemoteUserCpu=161746
scan-condor-job: RemoteSysCpu=12643
scan-condor-job: ImageSize=75000000
scan-condor-job: ExitStatus=0
scan-condor-job: JobStatus=3
scan-condor-job: ExitSignal=15

scan-condor-job: RemoveReason=The job attribute PeriodicRemoveexpression '(JobStatus == 1 && NumJobStarts > 0) || (RemoteWallClockTime> 24 * 60 * 60) || (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 *60 * 60)' evaluated to TRUE

scan-condor-job: JobCurrentStartDate=1724656973
scan-condor-job: EnteredCurrentStatus=1724656983
scan-condor-job: ExitReason=died on signal 15 (Terminated)
scan-condor-job: RequestCpus=1
scan-condor-job: -----------------------------------------------------
scan-condor-job: LRMSStartTime=20240826072253Z
scan-condor-job: LRMSEndTime=20240826072303Z
scan-condor-job: ExitCode not found in condor log and condor_history
scan-condor-job: ---- Requested resources specified in grami file ----
scan-condor-job: -----------------------------------------------------
scan-condor-job: PeriodicRemove evaluated to TRUE

2024-08-26T07:24:03Z Job state change INLRMS -> FINISHING Reason: Jobfailure detected2024-08-26T07:24:03Z Job state change FINISHING -> FINISHED Reason: Jobfailure detected

So the condition (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 * 60 *60) does evaluate to TRUE given the numbers above, but those numbersseem rather crazy, specifically:


scan-condor-job: RemoteWallClockTime=10
scan-condor-job: RemoteUserCpu=161746
scan-condor-job: RemoteSysCpu=12643

Unless I'm misunderstanding something, I don't see how a job with a walltime of 10s can use ~170000s of CPU! Has anyone seen this before or haveany idea what could be causing it? A few extra (possibly relevant) bitsof info:

* This seems to happen randomly and at a low level. The jobs that failare retried and work fine.

* The values work out to using 2 CPUs over 24 hours (24 * 60 * 60 * 2CPUs) - not sure if that's significant :)

* This has been seen at 2 sites so far but as it's a low level it couldbe happening elsewhere as well and is being masked by other things

* I don't know the versions of Condor being run but I would assume*relatively* recent as one of the sites is a large Tier 1

Happy to provide more info if possible but we are basically a user inthis situation and the admins don't know what's causing it either. Anyhelp would be greatly appreciated!



Thanks,

Mark

Follow-Ups:
- Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu
  - From: Jaime Frey

Prev by Date: Re: [HTCondor-users] Pool-wide custom resources
Next by Date: Re: [HTCondor-users] Job log not reporting Memory usage
Previous by thread: Re: [HTCondor-users] Is it possible to flock between a htcondor pool and htcondor-ce pool?
Next by thread: Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu
Index(es):
- Date
- Thread