Sorry for the lack of followup. Our expert on this section of the code is traveling and Iâve run out of good ideas for debugging this in the mean time. Heâll be back next week and Iâll poke him to take a look.
I can tell you that when HTCondor launches the job directly (vs asking Docker to run it), the cpu usage numbers for the job are taken from the cgroups file cpu.stat.
- Jaime
On Aug 29, 2024, at 2:49âAM, Mark Slater <mark.slater@xxxxxxx> wrote:
Hi Jaime,
Thanks a lot for replying! Just asked at Manchester and the answers are:
Can
you us the following about your Execution Points:
+ Condor version
23.0.10-1
+ linux distro
RHEL
9 based, Alma I believe
+ is Condor running as root
Yes
+ whether jobs are run with docker
Condor doesn't start a container itself, but the jobs will through singularity (i.e. the job script starts and soon into the run will start the actual payload using a singularity command).
Hope that helps!
Thanks,
Mark
These affect how HTCondor collects the cpu usage data for the job.
- Jaime
On Aug 27, 2024, at 12:11âPM, Mark Slater <mark.slater@xxxxxxx> wrote:
Hiya,
In LHCb we've been seeing some of our pilots killed at a couple of sites for using too many CPU cores. Specifically, the output says:
scan-condor-job: RemoteWallClockTime=10
scan-condor-job: RemoteUserCpu=161746
scan-condor-job: RemoteSysCpu=12643
scan-condor-job: ImageSize=75000000
scan-condor-job: ExitStatus=0
scan-condor-job: JobStatus=3
scan-condor-job: ExitSignal=15
scan-condor-job: RemoveReason=The job attribute PeriodicRemove _expression_ '(JobStatus == 1 && NumJobStarts > 0) || (RemoteWallClockTime > 24 * 60 * 60) || (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 * 60 * 60)' evaluated to TRUE
scan-condor-job: JobCurrentStartDate=1724656973
scan-condor-job: EnteredCurrentStatus=1724656983
scan-condor-job: ExitReason=died on signal 15 (Terminated)
scan-condor-job: RequestCpus=1
scan-condor-job: -----------------------------------------------------
scan-condor-job: LRMSStartTime=20240826072253Z
scan-condor-job: LRMSEndTime=20240826072303Z
scan-condor-job: ExitCode not found in condor log and condor_history
scan-condor-job: ---- Requested resources specified in grami file ----
scan-condor-job: -----------------------------------------------------
scan-condor-job: PeriodicRemove evaluated to TRUE
2024-08-26T07:24:03Z Job state change INLRMS -> FINISHING Reason: Job failure detected
2024-08-26T07:24:03Z Job state change FINISHING -> FINISHED Reason: Job failure detected
So the condition (RemoteSysCpu + RemoteUserCpu > RequestCpus * 24 * 60 * 60) does evaluate to TRUE given the numbers above, but those numbers seem rather crazy, specifically:
scan-condor-job: RemoteWallClockTime=10
scan-condor-job: RemoteUserCpu=161746
scan-condor-job: RemoteSysCpu=12643
Unless I'm misunderstanding something, I don't see how a job with a wall time of 10s can use ~170000s of CPU! Has anyone seen this before or have any idea what could be causing it? A few extra (possibly relevant) bits of info:
* This seems to happen randomly and at a low level. The jobs that fail are retried and work fine.
* The values work out to using 2 CPUs over 24 hours (24 * 60 * 60 * 2 CPUs) - not sure if that's significant :)
* This has been seen at 2 sites so far but as it's a low level it could be happening elsewhere as well and is being masked by other things
* I don't know the versions of Condor being run but I would assume *relatively* recent as one of the sites is a large Tier 1
Happy to provide more info if possible but we are basically a user in this situation and the admins don't know what's causing it either. Any help would be greatly appreciated!
Thanks,
Mark
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users
mailing list
To
unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
a
subject:
Unsubscribe
You
can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The
archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
|