When you say "global" config file, does that mean the above config settings are set not only on all your execute machines, but also upon your central manager?
I ask because some of the above settings are read by the condor_startd running on your execute nodes, but some (like PREEMPTION_REQUIREMENTS) are read by the condor_negotiator running on the central manager.
Also, did you remember to do a condor_reconfig -all after setting the above?
Side note: in HTCondor v8.0 and above, you can disable preemption just by setting one config knob MaxJobRetirementTime
( see http://goo.gl/thLqTh )
However, jobs continue to be stopped on one machine, and restarted (from
new, since no checkpointing) on the same or another machine [from a job
.log file]:
000 (231.001.000) 08/29 08:06:11 Job submitted from host: <1.2.3.189:9685>
...
001 (231.001.000) 08/29 08:06:29 Job executing on host: <1.2.3.246:9651>
...
006 (231.001.000) 08/29 08:06:37 Image size of job updated: 2500
1 - MemoryUsage of job (MB)
400 - ResidentSetSize of job (KB)
001 (231.001.000) 08/29 08:27:30 Job executing on host: <1.2.3.102:9619>
Are you "editing" the above .log file? It is pretty strange that there is no event saying the job was evicted from .246 before an execute event for .102 appears.
The job started on host .246, ran 20 minutes, then started over on .102.
Pretty suspicious that it goes for almost exactly 20 minutes, as 20 minutes is the default job_lease_time... see
http://goo.gl/ce4Lyg
The idea of the job_lease is if the execute machine fails to communicate with the submit machine for 20 minutes, the job will get killed. So perhaps the job is being killed off on the execute machine because it cannot communicate with the condor_schedd on the submit machine.... maybe there is a firewall preventing your execute machines from connecting to your submit machine? To test my wild guess, here is something to try: lets say your submit machine is my.submit.com (doing a condor_status -schedd will show all your submit machine names), can you login to an execute machine that kicked off the job like .246 and run:
condor_ping -type schedd -name my.submit.com read
This command will say "read...succeeded" or "read...failed" depending upon if it successfully could contact the schedd on your submit machine. If it says "failed", then we know what is happening, and you'll need to fix your firewall/network issue.I would want to see the StartLog on a machine like .246 from the time a job starts until it leaves. You will see in the log it saying the slot going to Claimed->Busy, then you will want to see the messages around where it changes away from Claimed->Busy...
So finally, my question: how can I examine the details of why HTC is doing
this machine switching? I've poked around in various log files but don't
see anything obvious. Or, what condor_status or condor_q commands would
reveal the motive for the switching?
Hope the above helps,