[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Many job failure for Atlas VOs



Hi Jean-Claude,

I suspect you are dealing with two things here, first is the failure on the workernode which should be envestigated further by reading through the starter/startd logs on the workernode. Also other  people more fluent in what is going on with atlas pilots will be able to help I suppose. 

The log entry in the sched is from the system-periodic-remove job that found your job and removed it for some reason. Using condor_history -l <job-id> you will be able to check the parameters and which part of the condition became true. (I suspect it was JobStatus == 1 && NumJobStarts > 0 which is kind of a tight ship you are running ;)) 

For the future you should put some intelligent remove reasons into the periodic-remove definition, roughly like this: 

# SYSTEM_PERIODIC_REMOVE with reasons
########################

# remove jobs running longer than 7 days
RemoveReadyJobs = (( JobStatus == 2 ) && ( ( CurrentTime - EnteredCurrentStatus ) > MaxJobRetirementTime ))

# remove jobs on hold for longer than 7 days
RemoveHeldJobs = ( (JobStatus==5 && (CurrentTime - EnteredCurrentStatus) > 7 * 24 * 3600) )

#  remove jobs with to many job starts or shadow starts
RemoveMultipleRunJobs = ( NumJobStarts >= 10 )

# remove jobs idle for too long 
MaxJobIdleTime = 7 * 24 * 3600
RemoveIdleJobs = (( JobStatus == 1 ) && ( ( CurrentTime - EnteredCurrentStatus ) > MaxJobIdleTime ))

# do it
SYSTEM_PERIODIC_REMOVE = $(RemoveHeldJobs)           || \
                         $(RemoveMultipleRunJobs)    || \
                         $(RemoveIdleJobs)           || \
                         $(RemoveReadyJobs)

# set reason for remove
SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by SYSTEM_PERIODIC_REMOVE due to ", \
 ifThenElse($(RemoveReadyJobs), "runtime longer than reserved", \
 ifThenElse($(RemoveHeldJobs), "being in hold state for 7 days", \
 ifThenElse($(RemoveMultipleRunJobs), "more than 10 failed jobstarts", \
 "being in idle state for 10 days"))),".") 

Best
christoph


-- 
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

----- UrsprÃngliche Mail -----
Von: "Jean-Claude CHEVALEYRE" <jean-claude.chevaleyre@xxxxxxxxxxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 8. November 2022 13:08:40
Betreff: [HTCondor-users] Many job failure for Atlas VOs

Hi there,

For several weeks, I have had an increasing number of Atlas jobs that fail with the following message:

sup, 9000: Diag from worker : LRMS error: (-1) Job was killed by Condor taskbuffer, 300: The worker was failed while the job was running : LRMS error: (-1) Job was killed by Condor


If I look in the output of the job logs, I have this kind of error message. However, I don't see how to debug this. How can we have access to the contents of the varaibles which could only allow me to understand why the expression below is "true":


The job attribute PeriodicRemove expression '(JobStatus == 1 && NumJobStarts > 0) || RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > JobTimeLimit || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)' evaluated to TRUE


below extract atlas log file for one job : 

scan-condor-job: condor log is at: /var/spool/arc/sessiondir/hEzLDmHiZC2nNlgJ7owi1f5nABFKDmABFKDm8pVKDmnLQKDmbVOzAo/log
scan-condor-job: ----- begin condor log (/var/spool/arc/sessiondir/hEzLDmHiZC2nNlgJ7owi1f5nABFKDmABFKDm8pVKDmnLQKDmbVOzAo/log) -----
000 (3227916.000.000) 2022-11-08 10:04:39 Job submitted from host: <134.158.121.102:9618?addrs=134.158.121.102-9618+[2001-660-5104-134-226-b9ff-fe75-159b]-9618&alias=clrarcce01.in2p3.fr&noUDP&sock=schedd_1328_79dc>
...
040 (3227916.000.000) 2022-11-08 11:50:21 Started transferring input files
	Transferring to host: <134.158.123.73:9618?addrs=134.158.123.73-9618&alias=clrwn259.in2p3.fr&noUDP&sock=slot1_4_2214_b693_7714>
...
040 (3227916.000.000) 2022-11-08 11:50:22 Finished transferring input files
...
001 (3227916.000.000) 2022-11-08 11:50:23 Job executing on host: <134.158.123.73:9618?addrs=134.158.123.73-9618&alias=clrwn259.in2p3.fr&noUDP&sock=startd_1964_b818>
...
006 (3227916.000.000) 2022-11-08 11:50:31 Image size of job updated: 56228
	30  -  MemoryUsage of job (MB)
	29868  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 11:55:31 Image size of job updated: 11393336
	126  -  MemoryUsage of job (MB)
	128852  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 12:00:32 Image size of job updated: 12436560
	243  -  MemoryUsage of job (MB)
	248360  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 12:05:33 Image size of job updated: 49021220
	11499  -  MemoryUsage of job (MB)
	11774552  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 12:10:33 Image size of job updated: 49021220
	12637  -  MemoryUsage of job (MB)
	12939284  -  ResidentSetSize of job (KB)
...
006 (3227916.000.000) 2022-11-08 12:10:33 Image size of job updated: 49021220
	12637  -  MemoryUsage of job (MB)
	12939288  -  ResidentSetSize of job (KB)
...
040 (3227916.000.000) 2022-11-08 12:10:33 Started transferring output files
...
040 (3227916.000.000) 2022-11-08 12:10:33 Finished transferring output files
...
009 (3227916.000.000) 2022-11-08 12:10:33 Job was aborted.
	The job attribute PeriodicRemove expression '(JobStatus == 1 && NumJobStarts > 0) || RemoteUserCpu + RemoteSysCpu > JobCpuLimit || RemoteWallClockTime > JobTimeLimit || ((ResidentSetSize =!= undefined ? ResidentSetSize : 0) > JobMemoryLimit)' evaluated to TRUE
...
scan-condor-job: ----- end condor log (/var/spool/arc/sessiondir/hEzLDmHiZC2nNlgJ7owi1f5nABFKDmABFKDm8pVKDmnLQKDmbVOzAo/log) -----
scan-condor-job: ----- Information extracted from Condor log -----
scan-condor-job: RemoteHost=134.158.123.73
scan-condor-job: ImageSize=49021220
scan-condor-job: -------------------------------------------------
scan-condor-job: scan-condor-job: No condor_history for Condor ID 3227916
scan-condor-job: ExitCode not found in condor log and condor_history
scan-condor-job: Job was killed by Condor
2022-11-08T11:17:54Z Job state change INLRMS -> FINISHING   Reason: Job failure detected
2022-11-08T11:17:54Z Job state change FINISHING -> FINISHED   Reason: Job failure detected


Any idea are welcome

Thanks for help.

Jean-Claude
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/