All, I
am trying to run condor using 2 Windows machines. Condor
version is 6.8.0 1 machine
is the Central Manager and I use it to submit jobs, the other one is just a
worker. Jobs
submitted for execution on the Central Manager machine (the same machine that I
submit the job from) execute perfectly. When
I submit a Job that needs to be executed on the other machine I get in the Job
log the following: 001
(042.000.000) 07/26 09:58:29 Job executing on host: <192.168.16.37:4195> ... 022
(042.000.000) 07/26 09:58:29 Job disconnected, attempting to reconnect
Socket between submit and execute hosts closed unexpectedly
Trying to reconnect to vm2@xxxxxxxxxxxxxxxxxxx <192.168.16.37:4195> ... And
the Job never finishes execution. The
Job is a Java job (using java universe). I
stopped all the firewalls on both machines and put: ADD_WINDOWS_FIREWALL_EXCEPTION
= false Both
machines have full READ and WRITE access to all machines in the subnet. I
also see on the machine that needs to execute the log the following: 7/26 10:09:41 Trying to query collector <192.168.16.23:9618> 7/26 10:09:50 condor_read(): recv() returned -1, errno
= 10054, assuming failure. 7/26 10:09:50 IO: EOF reading packet header 7/26 10:09:50 Couldn't fetch ads: communication error 7/26 10:09:50 Aborting negotiation cycle Any
suggestions? Thanks, Ronen |