On 10/15/2016 9:43 PM, Francisco Pereira wrote:
In /var/log/condor/MatchLog, I see "Matched <ID> <user> <IP for head:53694?addrs=IP for head-53694> preempting node2<IP for node:13698?address=IP for node2-13698> slot1@xxxxxxxxxxxxxxxxxxx Both of these messages recur every minute or so. On node2, only MASTER and STARTD are running, and neither of the respective logs show any mention of this job (using tail -f to track at the moment of submission. /etc/condor/condor_config is precisely the same between node1 and node2. The only difference between them is that, despite having the same domain in their FQDN (netA.netB.netC) the actual subnets are different ( node 2 is in ip2.ipB.ipC, whereas node1 and the head node are in ip1.ipB.ipC). /etc/hosts contains <IP> <name> <FQDN> for all three machines in each one of them.
Hi Francisco,Skimming you post, it looks like the job is being matched to the slot, but the schedd on the submit machine is unable to claim the machine. Just a quick thought - maybe this due to your HTCondor authorization settings. Do you see any permission denied messages in the node2 StartLog (i.e. grep -i "permission" StartLog)? Perhaps you are missing one of the subnets in the config knobs ALLOW_WRITE or HOSTALLOW_WRITE. If you are using FQDN names (i.e. *.wisc.edu) in your [HOST]ALLOW_WRITE, be aware that the proper way to list your /etc/hosts on linux is "<IP> <FQDN> <name>", not "<IP> <name> <FQDN>". See https://is.gd/yXyiDG for a discussion. Most of the time it doesn't matter if DNS is in use, but maybe it is causing you grief; HTCondor is pretty sensitive to how IPs are mapped back to FQDNs.
Another thought is perhaps there is an issue preempting a previous job on node2 - do you still have problems running on node2 even when node2 is completely idle?
hope the above helps, Todd