I have found a potential problem which may be
related. It seems the application was not executing
successfully without transferring my dataset file
with it.
So I have added the option:
transfer_input_files = dataset.dat
I then submit my job again... This time its
crashing something in condor.
-- Failed to fetch ads from:
<192.168.1.1:43028> : thebeast.cluster.int CEDAR:6001:Failed to
connect to <192.168.1.1:43028>
I kill the master and daemons.
I know execute the master again.
The Scheduler is not starting up.
condor@thebeast:~/jobs/som-oct-5th>
/home/condor/condor/sbin/condor_master condor@thebeast:~/jobs/som-oct-5th>
ps -fe | grep condor root 5532
3653 0 15:56 ? 00:00:00 sshd:
condor [priv] condor 5535 5532 0 15:56
? 00:00:00 sshd: condor@pts/3 condor
5536 5535 0 15:56 pts/3 00:00:01
-bash condor 6371 1 0 16:29
? 00:00:00
/home/condor/condor/sbin/condor_master condor 6372
6371 0 16:29 ? 00:00:00
condor_collector -f condor 6373 6371 0 16:29
? 00:00:01 condor_startd
-f condor 6375 6371 0 16:29
? 00:00:00 condor_negotiator
-f condor 6389 5536 0 16:29
pts/3 00:00:00 ps -fe condor 6390
5536 0 16:29 pts/3 00:00:00 grep
condor
The ScheddLog reports the following
10/11 16:13:37
****************************************************** 10/11 16:13:37 **
condor_schedd (CONDOR_SCHEDD) STARTING UP 10/11 16:13:37 **
/home/condor/condor/sbin/condor_schedd 10/11 16:13:37 ** $CondorVersion:
6.7.10 Aug 3 2005 $ 10/11 16:13:37 ** $CondorPlatform: I386-LINUX_RH9
$ 10/11 16:13:37 ** PID = 6412 10/11 16:13:37
****************************************************** 10/11 16:13:37 Using
config file: /home/condor/condor_config 10/11 16:13:37 Using local config
files: /home/condor/condor/hosts/thebeast/condor_config.local 10/11
16:13:37 DaemonCore: Command Socket at <192.168.1.1:43225> 10/11
16:13:37 SEC_DEFAULT_SESSION_DURATION is undefined, using default value of
3600 10/11 16:13:37 SCHEDD_TIMEOUT_MULTIPLIER is undefined, using default
value of 0 10/11 16:13:37 Will use UDP to update collector
thebeast.cluster.int <192.168.1.1:9618> 10/11 16:13:37 Using name:
thebeast.cluster.int 10/11 16:13:37 No Accountant host specified in config
file 10/11 16:13:37 SCHEDD_MIN_INTERVAL is undefined, using default value
of 5 10/11 16:13:37 JOB_START_COUNT is undefined, using default value of
1 10/11 16:13:37 MAX_JOBS_SUBMITTED is undefined, using default value of
2147483647 10/11 16:13:37 STARTD_CONTACT_TIMEOUT is undefined, using
default value of 45 10/11 16:13:37 initLocalStarterDir:
/home/condor/condor/hosts/thebeast/spool/local_univ_execute already exists,
deleting old contents 10/11 16:13:37 JOB_IS_FINISHED_INTERVAL is undefined,
using default value of 0 10/11 16:13:37 Period for SelfDrainingQueue
job_is_finished_queue set to 0 10/11 16:13:37 Queue Management Super
Users: 10/11 16:13:37 root 10/11 16:13:37 condor 10/11
16:13:37 CronMgr: Constructing 'schedd' 10/11 16:13:37 CronMgr: Setting
name to 'schedd' 10/11 16:13:37 CronMgr: Setting parameter base to
'schedd' 10/11 16:13:37 CronMgr: Doing config (initial) 10/11 16:13:37
About to truncate log
/home/condor/condor/hosts/thebeast/spool/job_queue.log 10/11 16:13:37
entering FileTransfer::SimpleInit
This is *before* it dies, no
information after it.
None of the other logs report anything out
of the ordinary.
So I kill the daemons again.
And delete the logs etc in the central
managers host directory
And recreate them
condor@thebeast:~/jobs/som-oct-5th>
/home/condor/condor/sbin/condor_init /home/condor/condor_config already
exists. Creating /home/condor/condor/hosts/thebeast/log Creating
/home/condor/condor/hosts/thebeast/spool Creating
/home/condor/condor/hosts/thebeast/execute /home/condor/condor/hosts/thebeast/condor_config.local
already exists. Condor has been initialized, but not
started.
And execute the master again.. The
schedduler now starts up????
Why is this happening
|