Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] startd stuck in a loop can not shut the daemon down
- Date: Tue, 02 Nov 2004 09:22:17 -0800
- From: John Weez <john@xxxxxxxxxx>
- Subject: [Condor-users] startd stuck in a loop can not shut the daemon down
Hi all,
I have a faulty submit_script. Condor starter log says it can not find
the executable file. Then it does a fast shutdown on all machines
according to the log. If i check the process status it now changes to
<defunct> and i can not shut it down... I look at teh starterLog and
this is what i see every ten second.....It seems starter has crashed and
startd can not tell the starter to shut down...so starter keeps trying
every 10 second....i still can not figure out how to kill the
processes..using linux kill command has no effect. I always have to
reboot when this happens.
Condor 6.7.2 Master: RedHat WS 3 Intel
Condor 6.7.2 executing machin is teh same machine as master
CondorStarter output:
11/2 08:51:20 Submitting machine is "Thezorb.atomfx.com"
11/2 08:51:20 File transfer completed successfully.
11/2 08:51:21 Starting a VANILLA universe job with ID: 167.0
11/2 08:51:21 IWD: /opt/condor-6.7.2/local.Thezorb/execute/dir_7640
11/2 08:51:21 Output file:
/opt/condor-6.7.2/local.Thezorb/execute/dir_7640/_condor_stdout_167.0
11/2 08:51:21 Error file:
/opt/condor-6.7.2/local.Thezorb/execute/dir_7640/_condor_stderr_167.0
11/2 08:51:21 About to exec
/opt/condor-6.7.2/local.Thezorb/execute/dir_7640/condor_exec.exe 1 1
/mnt/fileserver/production/shows/sot/comp/jm001/jm001_010/jm001_010_compLinux_v01.shk
11/2 08:51:21 Create_Process: child failed with errno 2 (No such file or
directory) before exec()
11/2 08:51:21 ERROR
"Create_Process(/opt/condor-6.7.2/local.Thezorb/execute/dir_7640/condor_exec.exe,condor_exec.exe
1 1
/mnt/fileserver/production/shows/sot/comp/jm001/jm001_010/jm001_010_compLinux_v01.shk,
...) failed" at line 403 in file os_proc.C
11/2 08:51:21 ShutdownFast all jobs.
CONDORSTARTLOG output: (LOOPS over and over)
11/2 09:16:03 Connect failed for 10 seconds; returning FALSE
11/2 09:16:03 ERROR: SECMAN:2003:TCP connection to <192.168.0.2:40472>
failed
11/2 09:16:03 Send_Signal: ERROR Connect to <192.168.0.2:40472>
failed.11/2 09:16:03 Error sending signal to starter, errno = 25
(Inappropriate ioctl for device)
11/2 09:16:03 State change: Error sending signals to starter
11/2 09:16:03 Can't connect to <192.168.0.2:40472>:0, errno = 111
11/2 09:16:03 Will keep trying for 10 seconds...
Thanks for any ideas, JW