Hello, Weâre using Condorâs Docker universe (HTCondor 8.6.1 , Docker 1.27) on amazon ec2 instances wherein jobs are being terminated intermittently (the same job executes successfully sometimes) when the reported
MemoryUsage exceeds un-reasonably higher than the specified RequestMemory. This was NOT the case earlier when we ran the same jobs on HTCondorâs Standard Universe with own wrapper to execute Docker run. Any suggestion/help would be appreciated: Pasting the logs and the condor submit file requirements below: 000 (112667.000.000) 08/04 20:49:24 Job submitted from host: <10.XXX.X.XXX:9618?addrs=10.XXX.X.XXX-9618+[--1]-9618&noUDP&sock=24501_8a68_3> DAG Node: block_0000 ... 001 (112667.000.000) 08/04 21:54:23 Job executing on host: <10.XXX.X.XXX:34479?addrs=10.XXX.X.XXX-34479+[--1]-34479> ... 006 (112667.000.000) 08/04 21:54:24 Image size of job updated: 133664575 133664575 - MemoryUsage of job (MB) ... 005 (112667.000.000) 08/04 21:54:25 Job terminated. (0) Abnormal termination (signal 1) (0) No core file Usr 0 00:00:00, Sys 0 00:00:01 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:01 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 182 - Run Bytes Sent By Job 9689 - Run Bytes Received By Job 182 - Total Bytes Sent By Job 9689 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 23 10 1047702 Memory (MB) : 133664575 1844 1844 UNIVERSE = docker â â â LOG = job.log JOB_MACHINE_ATTRS = Machine JOB_MACHINE_ATTRS_HISTORY_LENGTH = 5 JobLeaseDuration = 600 REQUIREMENTS = HAS_DOCKER && HAS_RCP_DFS && (WORKER_TYPE == "SMALL") && target.machine =!= MachineAttrMachine1 && target.machine =!= MachineAttrMachine2 RequestMemory = 1.8G RequestCpus = 1 PRIORITY = 1201 PERIODIC_REMOVE = ((JobStatus==5) && (CurrentTime - EnteredCurrentStatus) > 300) || \
((JobStatus==2) && (CurrentTime - EnteredCurrentStatus) > 3600)
QUEUE |