In case anyone wants to know the solution. The symptom is processes dying after exactly 20 minutes. That's the clue that ALIVE's aren't getting through.Removing the entry in /etc/hosts that mapped "f0.<mydom>.local" to 127.0.0.1 on the schedd machine (which was also the collector/negotiator... so I'm not sure it's dependent on schedd) worked immediately to allow ALIVES to go through.