Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu

Date: Wed, 28 Aug 2024 20:24:57 +0000
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu

That does sound like bogus values.
Can you us the following about your Execution Points:
+ Condor version
+ linux distro
+ is Condor running as root
+ whether jobs are run with docker

These affect how HTCondor collects the cpu usage data for the job.

 - Jaime

> On Aug 27, 2024, at 12:11âPM, Mark Slater <mark.slater@xxxxxxx> wrote:
> 
> Hiya,
> 
> In LHCb we've been seeing some of our pilots killed at a couple of sites for using too many CPU cores. Specifically, the output says:
> 
> 
> scan-condor-job: RemoteWallClockTime=10
> scan-condor-job: RemoteUserCpu=161746
> scan-condor-job: RemoteSysCpu=12643
> scan-condor-job: ImageSize=75000000
> scan-condor-job: ExitStatus=0
> scan-condor-job: JobStatus=3
> scan-condor-job: ExitSignal=15
> scan-condor-job: RemoveReason=The job attribute PeriodicRemove expression '(JobStatus == 1 && NumJobStarts > 0) || (RemoteWallClockTime > 24 * 60 * 60) || (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 * 60 * 60)' evaluated to TRUE
> scan-condor-job: JobCurrentStartDate=1724656973
> scan-condor-job: EnteredCurrentStatus=1724656983
> scan-condor-job: ExitReason=died on signal 15 (Terminated)
> scan-condor-job: RequestCpus=1
> scan-condor-job: -----------------------------------------------------
> scan-condor-job: LRMSStartTime=20240826072253Z
> scan-condor-job: LRMSEndTime=20240826072303Z
> scan-condor-job: ExitCode not found in condor log and condor_history
> scan-condor-job: ---- Requested resources specified in grami file ----
> scan-condor-job: -----------------------------------------------------
> scan-condor-job: PeriodicRemove evaluated to TRUE
> 2024-08-26T07:24:03Z Job state change INLRMS -> FINISHING Reason: Job failure detected
> 2024-08-26T07:24:03Z Job state change FINISHING -> FINISHED Reason: Job failure detected
> 
> 
> So the condition (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 * 60 * 60) does evaluate to TRUE given the numbers above, but those numbers seem rather crazy, specifically:
> 
> scan-condor-job: RemoteWallClockTime=10
> scan-condor-job: RemoteUserCpu=161746
> scan-condor-job: RemoteSysCpu=12643
> 
> Unless I'm misunderstanding something, I don't see how a job with a wall time of 10s can use ~170000s of CPU! Has anyone seen this before or have any idea what could be causing it? A few extra (possibly relevant) bits of info:
> 
> * This seems to happen randomly and at a low level. The jobs that fail are retried and work fine.
> 
> * The values work out to using 2 CPUs over 24 hours (24 * 60 * 60 * 2 CPUs) - not sure if that's significant :)
> 
> * This has been seen at 2 sites so far but as it's a low level it could be happening elsewhere as well and is being masked by other things
> 
> * I don't know the versions of Condor being run but I would assume *relatively* recent as one of the sites is a large Tier 1
> 
> 
> Happy to provide more info if possible but we are basically a user in this situation and the admins don't know what's causing it either. Any help would be greatly appreciated!
> 
> 
> Thanks,
> 
> Mark
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Follow-Ups:
- Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu
  - From: Mark Slater

References:
- [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu
  - From: Mark Slater

Prev by Date: [HTCondor-users] where is condor_upgrade_check ?
Next by Date: [HTCondor-users] Help with software RASPA on HTCondor
Previous by thread: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu
Next by thread: Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Massive overinflation of RemoteUserCpu & RemoteSysCpu