We have a pool of 12 Windows machines running Condor 6.6.8 with
one of them as the central manager. They share the same config file
stored on the network.
We're trying to use condor at work, but I am tending to run into
the same problem. In the case of a submission of, say 30 jobs, many of
them fail. They tend to fail on remote machines, where, for example, we
see the following in the StartLog (on the remote host) files:
4/6 13:24:11 vm1: Changing state and activity: Claimed/Busy -> Preempting/Vacating
4/6 13:24:12 Can't connect to <
10.10.30.60:1685>:0, errno = 10061
4/6 13:24:12 Will keep trying for 10 seconds...
4/6 13:24:21 Connect failed for 10 seconds; returning FALSE
4/6 13:24:21 ERROR:
SECMAN:2003:TCP connection to <
10.10.30.60:1685> failed
/6 13:23:57 ******************************************************
4/6 13:23:57 ** condor_starter (CONDOR_STARTER) STARTING UP
4/6 13:23:57 ** C:\Condor\bin\condor_starter.exe
4/6 13:23:57 ** $CondorVersion: 6.6.8 Jan 31 2005 $
4/6 13:23:57 ** $CondorPlatform: INTEL-WINNT40 $
4/6 13:23:57 ** PID = 3236
4/6 13:23:57 ******************************************************
4/6 13:23:57 Using config file: //homer/india/condor_config
4/6 13:23:57 Using local config files: C:\Condor/condor_config.local
4/6 13:23:57 DaemonCore: Command Socket at <
10.10.30.60:1672>
4/6 13:23:57 Setting resource limits not implemented!
4/6 13:23:58 Starter communicating with condor_shadow <
10.10.30.24:4804>
4/6 13:23:58 Submitting machine is "med2.fsca.local"
4/6 13:23:58 DynUser: MultiByteToWideChar() failed error=1113
4/6 13:23:58 ERROR "Unexpected failure in dynuser:update_t
" at line 472 in file ..\src\condor_c++_util\dynuser.C
4/6 13:23:58 ShutdownFast all jobs.
Our jobs are submitted via a dagman.