[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Many job failure for Atlas VOs



Hi there,

For several weeks, I have had an increasing number of Atlas jobs that fail with the following message:

sup, 9000: Diag from worker : LRMS error: (-1) Job was killed by Condor taskbuffer, 300: The worker was failed while the job was running : LRMS error: (-1) Job was killed by Condor


If I look in the output of the job logs, I have this kind of error message. However, I don't see how to debug this. How can we have access to the contents of the varaibles which could only allow me to understand why the expression below is "true":


The job attribute PeriodicRemove expression '(JobStatus == 1 && NumJobStarts > 0) || RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > JobTimeLimit || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)' evaluated to TRUE


below extract atlas log file for one job : 

scan-condor-job: condor log is at: /var/spool/arc/sessiondir/hEzLDmHiZC2nNlgJ7owi1f5nABFKDmABFKDm8pVKDmnLQKDmbVOzAo/log
scan-condor-job: ----- begin condor log (/var/spool/arc/sessiondir/hEzLDmHiZC2nNlgJ7owi1f5nABFKDmABFKDm8pVKDmnLQKDmbVOzAo/log) -----
000 (3227916.000.000) 2022-11-08 10:04:39 Job submitted from host: <134.158.121.102:9618?addrs=134.158.121.102-9618+[2001-660-5104-134-226-b9ff-fe75-159b]-9618&alias=clrarcce01.in2p3.fr&noUDP&sock=schedd_1328_79dc>
...
040 (3227916.000.000) 2022-11-08 11:50:21 Started transferring input files
	Transferring to host: <134.158.123.73:9618?addrs=134.158.123.73-9618&alias=clrwn259.in2p3.fr&noUDP&sock=slot1_4_2214_b693_7714>
...
040 (3227916.000.000) 2022-11-08 11:50:22 Finished transferring input files
...
001 (3227916.000.000) 2022-11-08 11:50:23 Job executing on host: <134.158.123.73:9618?addrs=134.158.123.73-9618&alias=clrwn259.in2p3.fr&noUDP&sock=startd_1964_b818>
...
006 (3227916.000.000) 2022-11-08 11:50:31 Image size of job updated: 56228
	30  -  MemoryUsage of job (MB)
	29868  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 11:55:31 Image size of job updated: 11393336
	126  -  MemoryUsage of job (MB)
	128852  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 12:00:32 Image size of job updated: 12436560
	243  -  MemoryUsage of job (MB)
	248360  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 12:05:33 Image size of job updated: 49021220
	11499  -  MemoryUsage of job (MB)
	11774552  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 12:10:33 Image size of job updated: 49021220
	12637  -  MemoryUsage of job (MB)
	12939284  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 12:10:33 Image size of job updated: 49021220
	12637  -  MemoryUsage of job (MB)
	12939288  -  ResidentSetSize of job (KB)
...
040 (3227916.000.000) 2022-11-08 12:10:33 Started transferring output files
...
040 (3227916.000.000) 2022-11-08 12:10:33 Finished transferring output files
...
009 (3227916.000.000) 2022-11-08 12:10:33 Job was aborted.
	The job attribute PeriodicRemove expression '(JobStatus == 1 && NumJobStarts > 0) || RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > JobTimeLimit || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)' evaluated to TRUE
...
scan-condor-job: ----- end condor log (/var/spool/arc/sessiondir/hEzLDmHiZC2nNlgJ7owi1f5nABFKDmABFKDm8pVKDmnLQKDmbVOzAo/log) -----
scan-condor-job: ----- Information extracted from Condor log -----
scan-condor-job: RemoteHost=134.158.123.73
scan-condor-job: ImageSize=49021220
scan-condor-job: -------------------------------------------------
scan-condor-job: scan-condor-job: No condor_history for Condor ID 3227916
scan-condor-job: ExitCode not found in condor log and condor_history
scan-condor-job: Job was killed by Condor
2022-11-08T11:17:54Z Job state change INLRMS -> FINISHING   Reason: Job failure detected
2022-11-08T11:17:54Z Job state change FINISHING -> FINISHED   Reason: Job failure detected


Any idea are welcome

Thanks for help.

Jean-Claude