We have the situation where a user submits ~10 jobs, all of which should run for ~5 hours. Many/most of them get repeatedly evicted after 30 mins and requeued. Below are the relevent logs from the submitting and execute machines for one particular instance. I have tested this myself with different jobs and the eviction is ALWAYS ALMOST EXACTLY a few seconds (20?) under 30 minutes. The line in the START LOG: 3/1 05:57:16 State change: claim timed out (condor_schedd gone?) seems to be the relevant one? ALL of the evictions (for different execute machines and different jobs, same submit machine) occur at 30 minutes.
While a job is running, the schedd periodically sends an alive message to the startd via UDP. If the startd doesn't receive any alive messages for a while, it will kill the claim (and the job). The default is for the schedd to send an alive every 5 minutes and the startd will kill the job if it misses 6 alives, which matches your 30 minutes.
So it appears that UDP packets aren't making it from your submit machine to your execute machines.
+--------------------------------+-----------------------------------+ | Jaime Frey | I used to be a heavy gambler. | | jfrey@xxxxxxxxxxx | But now I just make mental bets. | | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. | +--------------------------------+-----------------------------------+