However, jobs continue to be stopped on one machine, and restarted (from new, since no checkpointing) on the same or another machine [from a job .log file]:HTCondor 8.0.2, pool is entirely Windows 7x64.Being a Windows pool, there is no checkpointing and we do not want eviction or preemption. Therefore in the global config file I have (copied from the manual):
#Disable preemption by machine activity.
PREEMPT = False
#Disable preemption by user priority.
PREEMPTION_REQUIREMENTS = False
#Disable preemption by machine RANK by ranking all jobs equally.
RANK = 0
#Since we are disabling claim preemption, we
# may as well optimize negotiation for this case:
NEGOTIATOR_CONSIDER_PREEMPTION = False
# Without preemption, it is advisable to limit the time during
# which the submit node may keep reusing the same slot for
# more jobs.
CLAIM_WORKLIFE = 3600
UPDATE_INTERVAL = 180
WANT_SUSPEND = TRUE
KILL = FALSE
000 (231.001.000) 08/29 08:06:11 Job submitted from host: <1.2.3.189:9685>
...
001 (231.001.000) 08/29 08:06:29 Job executing on host: <1.2.3.246:9651>
...
006 (231.001.000) 08/29 08:06:37 Image size of job updated: 2500
1 - MemoryUsage of job (MB)
400 - ResidentSetSize of job (KB)
001 (231.001.000) 08/29 08:27:30 Job executing on host: <1.2.3.102:9619>
The job started on host .246, ran 20 minutes, then started over on .102.So finally, my question: how can I examine the details of why HTC is doing this machine switching? I've poked around in various log files but don't see anything obvious. Or, what condor_status or condor_q commands would reveal the motive for the switching?
Thanks,Ralph FinchCalif. Dept. of Water Resources