I have a user who is reporting that his job is getting evicted
after running for several hours. This is puzzling since I have
PREEMPT and PREEMPTION_REQUIREMENTS set to FALSE across my cluster,
e.g. if a user starts they should get to finish. The user reports
(and I confirm) that this same job has been evicted a couple times
in a row now. Nevertheless it restarted on another node after
the eviction and is still running now. It doesn't appear to be
a case of memory. This is condor 6.7.13
Here's the log of the job:
[root@fnpcsrv1 condor_log]# more lia_fd_05_11_1_4.log
000 (53837.000.000) 01/23 10:49:58 Job submitted from host:
<131.225.167.42:3908
2>
...
001 (53837.000.000) 01/23 10:50:03 Job executing on host:
<131.225.167.170:32772
...
006 (53837.000.000) 01/23 10:50:11 Image size of job updated: 6360
...
006 (53837.000.000) 01/23 11:10:11 Image size of job updated: 154944
...
004 (53837.000.000) 01/23 17:22:45 Job was evicted.
(0) Job was not checkpointed.
Usr 0 04:52:00, Sys 0 00:07:16 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
--------------------------------------
From the node in question, here's the appropriate section of the StartLog
1/23 10:49:59 DaemonCore: Command received via UDP from host
<131.225.167.42:191
77>
1/23 10:49:59 DaemonCore: received command 440 (MATCH_INFO), calling
handler (command_match_info)
1/23 10:49:59 vm1: match_info called
1/23 10:49:59 vm1: Received match <131.225.167.170:32772>#1137606512#551
1/23 10:49:59 vm1: State change: match notification protocol successful
1/23 10:49:59 vm1: Changing state: Unclaimed -> Matched
1/23 10:49:59 DaemonCore: Command received via TCP from host
<131.225.167.42:177
17>
1/23 10:49:59 DaemonCore: received command 442 (REQUEST_CLAIM), calling
handler
(command_request_claim)
1/23 10:49:59 vm1: Request accepted.
1/23 10:49:59 vm1: Remote owner is rubin@xxxxxxxx
1/23 10:49:59 vm1: State change: claiming protocol successful
1/23 10:49:59 vm1: Changing state: Matched -> Claimed
1/23 10:50:03 DaemonCore: Command received via TCP from host
<131.225.167.42:17723>
1/23 10:50:03 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler
(command_activate_claim)
1/23 10:50:03 vm1: Got activate_claim request from shadow(<131.225.167.42:17723>)
1/23 10:50:03 vm1: Remote job ID is 53837.0
1/23 10:50:03 vm1: Got universe "VANILLA" (5) from request classad
1/23 10:50:03 vm1: State change: claim-activation protocol successful
1/23 10:50:03 vm1: Changing activity: Idle -> Busy