I am seeing jobs killed when they exceed their requested memory.
I believe I have shut off any preemption or eviction, but that does not seem to be the case. Below is our condor_local and a typical submit file (we are running using DAGman), and a submit.log. Note that we request_memory is 24GB and we seem to be exiting approximately (and prematurely) at 24GB. I believe the process may be requesting more and these particular nodes have a lot more (unused) memory on them.
CONDOR_HOST = master
COLLECTOR_NAME = GRID
COLLECTOR_HOST = $(CONDOR_HOST):9886?sock=collector
DAEMON_LIST = {{getv "/condor/daemons"}}
# DAEMON_LIST = MASTER, SCHEDD, STARTD
# DAEMON_LIST = MASTER, SCHEDD
## When something goes wrong with condor at your site, who should get
## the email?
CONDOR_ADMIN =
admin@xxxxxxxx#UID_DOMAIN =
viqi.org#TRUST_UID_DOMAIN = TRUE
#SOFT_UID_DOMAIN = TRUE
#FILESYSTEM_DOMAIN =
viqi.org## Do you want to use NFS for file access instead of remote system calls
ALLOW_READ = $(ALLOW_READ), 172.*, 10.*, {{getv "/condor/allowextra" ""}}
ALLOW_WRITE = $(ALLOW_WRITE), 172.*, 10.*, {{getv "/condor/allowextra" ""}}
ALLOW_NEGOTIATOR = 172.*, 10.*, 128.111.*, {{getv "/condor/allowextra" ""}}
#ALLOW_READ = $(ALLOW_READ), 172.*, 10.*, *.
viqi.org#ALLOW_WRITE = $(ALLOW_WRITE), 172.*, 10.*, *.
viqi.org#ALLOW_NEGOTIATOR = 172.*, 10.*, 128.111.*
#ALLOW_ADMINISTRATOR = 172.*, 10.*,128.111.*
#ALLOW_CONFIG = 172.*,10.*,128.111.*
#ALLOW_DAEMON = 172.*,10.*,128.111.*
# Use CCB with shared port so outside units can talk to
USE_SHARED_PORT = TRUE
SHARED_PORT_ARGS = -p 9886
UPDATE_COLLECTOR_WITH_TCP = TRUE
CCB_ADDRESS = $(COLLECTOR_HOST)
PRIVATE_NETWORK_NAME = VIQI
BIND_ALL_INTERFACES = TRUE
SEC_DEFAULT_AUTHENTICATION = NEVER
SEC_DEFAULT_NEGOTIATION = NEVER
#
https://lists.cs.wisc.edu/archive/htcondor-users/2016-December/msg00046.shtmlDISCARD_SESSION_KEYRING_ON_STARTUP = false
# Slots for multi-cpu machines
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = true
START = True
PREEMPT = False
SUSPEND = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE= False
CONTINUE= True
cat launcher.log
000 (305.000.000) 09/29 19:57:46 Job submitted from host: <
10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (305.000.000) 09/29 19:58:02 Job executing on host: <
10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (305.000.000) 09/29 19:58:11 Image size of job updated: 20472
20 - MemoryUsage of job (MB)
20472 - ResidentSetSize of job (KB)
...
006 (305.000.000) 09/29 20:03:12 Image size of job updated: 20824
21 - MemoryUsage of job (MB)
20824 - ResidentSetSize of job (KB)
...
006 (305.000.000) 09/29 20:08:12 Image size of job updated: 22548
23 - MemoryUsage of job (MB)
22548 - ResidentSetSize of job (KB)
...
005 (305.000.000) 09/29 20:31:19 Job terminated.
(1) Normal termination (return value 137)
Usr 0 00:00:00, Sys 0 00:00:02 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:02 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
1487908352 - Run Bytes Sent By Job
932 - Run Bytes Received By Job
1487908352 - Total Bytes Sent By Job
932 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 0.00 1 1
Disk (KB) : 1453026 1 61449
Memory (MB) : 23 24000 24064
...
000 (306.000.000) 09/29 20:31:32 Job submitted from host: <
10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (306.000.000) 09/29 20:31:32 Job executing on host: <
10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (306.000.000) 09/29 20:31:41 Image size of job updated: 20140
20 - MemoryUsage of job (MB)
20140 - ResidentSetSize of job (KB)
...
006 (306.000.000) 09/29 20:36:42 Image size of job updated: 20304
20 - MemoryUsage of job (MB)
20304 - ResidentSetSize of job (KB)
...
006 (306.000.000) 09/29 20:51:43 Image size of job updated: 22100
22 - MemoryUsage of job (MB)
22100 - ResidentSetSize of job (KB)
...
006 (306.000.000) 09/29 21:03:48 Image size of job updated: 314076
22 - MemoryUsage of job (MB)
22100 - ResidentSetSize of job (KB)
...
005 (306.000.000) 09/29 21:04:05 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:02 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:02 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
1490578432 - Run Bytes Sent By Job
932 - Run Bytes Received By Job
1490578432 - Total Bytes Sent By Job
932 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 0.16 1 1
Disk (KB) : 1455635 1 61449
Memory (MB) : 22 24000 24064
...
000 (307.000.000) 09/29 21:04:18 Job submitted from host: <
10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (307.000.000) 09/29 21:04:19 Job executing on host: <
10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (307.000.000) 09/29 21:04:28 Image size of job updated: 20072
20 - MemoryUsage of job (MB)
20072 - ResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:09:29 Image size of job updated: 20224
20 - MemoryUsage of job (MB)
20224 - ResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:14:30 Image size of job updated: 20340
20 - MemoryUsage of job (MB)
20340 - ResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:24:30 Image size of job updated: 22084
22 - MemoryUsage of job (MB)
22084 - ResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:34:32 Image size of job updated: 22168
22 - MemoryUsage of job (MB)
22084 - ResidentSetSize of job (KB)
...
005 (307.000.000) 09/29 21:37:06 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:02 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:02 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
1490579456 - Run Bytes Sent By Job
932 - Run Bytes Received By Job
1490579456 - Total Bytes Sent By Job
932 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 0 1 1
Disk (KB) : 1455636 1 61449
Memory (MB) : 22 24000 24064
...
000 (308.000.000) 09/29 21:37:19 Job submitted from host: <
10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (308.000.000) 09/29 21:37:40 Job executing on host: <
10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (308.000.000) 09/29 21:37:48 Image size of job updated: 22048
22 - MemoryUsage of job (MB)
22048 - ResidentSetSize of job (KB)
...
006 (308.000.000) 09/29 21:42:48 Image size of job updated: 22248
22 - MemoryUsage of job (MB)
22248 - ResidentSetSize of job (KB)
...
005 (308.000.000) 09/29 22:10:33 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:02 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:02 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
1490580608 - Run Bytes Sent By Job
932 - Run Bytes Received By Job
1490580608 - Total Bytes Sent By Job
932 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 0 1 1
Disk (KB) : 1455637 1 61449
Memory (MB) : 22 24000 24064
...