I submit jobs to my cluster but no job can run because they all disconnected. Here is my condor version(I am using Rocks to manage my cluster):
[kyle@imagegrid ~]$ cat showpwd.log
000 (002.000.000) 05/14 23:24:57 Job submitted from host: <
192.168.1.100:40073>
...
000 (002.001.000) 05/14 23:24:57 Job submitted from host: <
192.168.1.100:40073>
...
000 (002.002.000) 05/14 23:24:57 Job submitted from host: <
192.168.1.100:40073>
...
000 (002.003.000) 05/14 23:24:57 Job submitted from host: <
192.168.1.100:40073>
...
022 (002.000.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@xxxxxxxxxxxxxxx <
10.255.255.254:45256>
...
024 (002.000.000) 05/14 23:24:57 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.001.000) 05/14 23:24:57 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot2@xxxxxxxxxxxxxxx <
10.255.255.254:45256>
...
024 (002.001.000) 05/14 23:24:57 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.002.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot3@xxxxxxxxxxxxxxx <
10.255.255.254:45256>
...
022 (002.003.000) 05/14 23:24:58 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot4@xxxxxxxxxxxxxxx <
10.255.255.254:45256>
...
024 (002.003.000) 05/14 23:24:58 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
...
024 (002.002.000) 05/14 23:25:06 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.000.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot1@xxxxxxxxxxxxxxx <
10.255.255.254:45256>
...
024 (002.000.000) 05/14 23:26:58 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot1@xxxxxxxxxxxxxxx, rescheduling job
...
022 (002.001.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot2@xxxxxxxxxxxxxxx <
10.255.255.254:45256>
...
022 (002.002.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot3@xxxxxxxxxxxxxxx <
10.255.255.254:45256>
...
022 (002.003.000) 05/14 23:26:58 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to slot4@xxxxxxxxxxxxxxx <
10.255.255.254:45256>
...
024 (002.003.000) 05/14 23:26:58 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot4@xxxxxxxxxxxxxxx, rescheduling job
...
024 (002.001.000) 05/14 23:27:06 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot2@xxxxxxxxxxxxxxx, rescheduling job
...
024 (002.002.000) 05/14 23:27:06 Job reconnection failed
Job not found at execution machine
Can not reconnect to slot3@xxxxxxxxxxxxxxx, rescheduling job
...
I can see that after submission, some slots became claimed, but after few seconds, they became Unclaimed again.
ALLOW_WRITE = $(HOSTALLOW_WRITE)
AMAZON_GAHP = $(SBIN)/amazon_gahp
AMAZON_GAHP_LOG = /tmp/AmazonGahpLog.$(USERNAME)
COLLECTOR_NAME = Collector at
imagegrid.otitan.comCOLLECTOR_SOCKET_CACHE_SIZE = 1000
CONDOR_ADMIN =
condor@xxxxxxxxxxxxxxxxxxxxCONDOR_DEVELOPERS = NONE
CONDOR_DEVELOPERS_COLLECTOR = NONE
CONDOR_HOST =
imagegrid.otitan.com
CONDOR_IDS = 407.500
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
CONTINUE = True
DAEMON_LIST = MASTER, STARTD
EMAIL_DOMAIN = $(FULL_HOSTNAME)
FILESYSTEM_DOMAIN =
otitan.com
HIGHPORT = 50000
HOSTALLOW_WRITE =
imagegrid.otitan.com, *.local, *.local
JAVA = /usr/bin/java
KILL = False
LOCAL_DIR = /var/opt/condor
LOCK = /tmp/condor-lock.$(HOSTNAME)
LOWPORT = 40000
MAIL = /bin/mail
NEGOTIATOR_INTERVAL = 120
NETWORK_INTERFACE = 10.255.255.254
PREEMPT = False
RANK = None
RELEASE_DIR = /opt/condor
SOAP_SSL_CA_FILE = /etc/pki/tls/cert.pem
START = True
STARTD_EXPRS = $(STARTD_EXPRS)
SUSPEND = False
UID_DOMAIN = local
UPDATE_COLLECTOR_WITH_TCP = True
WANT_SUSPEND = False
WANT_VACATE = False
# First set JAVA_MAXHEAP_ARGUMENT to null, to disable the default of max RAM
JAVA_MAXHEAP_ARGUMENT =
JAVA_EXTRA_ARGUMENTS = -Xmx1906m