Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Schedd and Startd crashes
- Date: Sat, 19 May 2007 07:59:42 -0700
- From: "Rick Lan" <Rick.Lan@xxxxxxxxxxxx>
- Subject: [Condor-users] Schedd and Startd crashes
Hello
Our configuration
includes an email to report problems. About every month or so, a couple of our
machines (different ones) would send the consecutive emails about
schedd and startd crashes. The content of the email seems similar between schedd
crashes; same apply to startd crashes. Any pointers on how to debug this? All
PC's are running Windows XP.
Thanks
Schedd
crashes - Email chain #1
This is an automated
email from the Condor system on machine "nbs60.bbnet.ad". Do not
reply.
"D:\condor/bin/condor_schedd.exe" on "nbs60.bbnet.ad" exited with status
44.
Condor will automatically restart this process in 10
seconds.
*** Last 20 line(s)
of file SchedLog:
5/7 00:10:42 (pid:956) -------- Begin starting jobs
--------
5/7 00:10:42 (pid:956) -------- Done starting jobs --------
5/7
00:14:15 (pid:956) Getting monitoring info for pid 956
5/7 00:15:42 (pid:956)
JobsRunning = 0
5/7 00:15:43 (pid:956) JobsIdle = 0
5/7 00:15:43 (pid:956)
JobsHeld = 0
5/7 00:15:43 (pid:956) JobsRemoved = 0
5/7 00:15:43 (pid:956)
LocalUniverseJobsRunning = 0
5/7 00:15:43 (pid:956) LocalUniverseJobsIdle =
0
5/7 00:15:43 (pid:956) SchedUniverseJobsRunning = 0
5/7 00:15:43
(pid:956) SchedUniverseJobsIdle = 0
5/7 00:15:43 (pid:956) N_Owners =
0
5/7 00:15:43 (pid:956) MaxJobsRunning = 200
5/7 00:15:43 (pid:956)
Trying to update collector <172.26.21.99:9618>
5/7 00:15:43 (pid:956)
Attempting to send update via UDP to collector nbs40.bbnet.ad
<172.26.21.99:9618>
5/7 00:15:43
(pid:956) Sent HEART BEAT ad to 1 collectors. Number of submittors=0
5/7
00:15:43 (pid:956) ============ Begin clean_shadow_recs =============
5/7
00:15:43 (pid:956) ============ End clean_shadow_recs =============
5/7
00:15:43 (pid:956) -------- Begin starting jobs --------
5/7 00:15:43
(pid:956) -------- Done starting jobs --------
*** End of file
SchedLog
Schedd
crashes - email chain #2
This is an automated
email from the Condor system on machine "nbs60.bbnet.ad". Do not
reply.
"D:\condor/bin/condor_schedd.exe" on "nbs60.bbnet.ad" exited with status
44.
Condor will automatically restart this process in 11
seconds.
*** Last 20 line(s)
of file SchedLog:
5/7 00:18:45 (pid:4012) JobsHeld = 0
5/7 00:18:45
(pid:4012) JobsRemoved = 0
5/7 00:18:45 (pid:4012) LocalUniverseJobsRunning =
0
5/7 00:18:45 (pid:4012) LocalUniverseJobsIdle = 0
5/7 00:18:45
(pid:4012) SchedUniverseJobsRunning = 0
5/7 00:18:45 (pid:4012)
SchedUniverseJobsIdle = 0
5/7 00:18:45 (pid:4012) N_Owners = 0
5/7
00:18:45 (pid:4012) MaxJobsRunning = 200
5/7 00:18:45 (pid:4012) Trying to
update collector <172.26.21.99:9618>
5/7 00:18:45 (pid:4012) Attempting
to send update via UDP to collector nbs40.bbnet.ad
<172.26.21.99:9618>
5/7 00:18:45
(pid:4012) File descriptor limits: max 2000, safe 1600
5/7 00:18:45
(pid:4012) Ignoring file descriptor safety limit (1600), because only 4 sockets
are registered (fd is 1776)
5/7 00:18:45
(pid:4012) Sent HEART BEAT ad to 1 collectors. Number of submittors=0
5/7
00:18:45 (pid:4012) ============ Begin clean_shadow_recs =============
5/7
00:18:45 (pid:4012) ============ End clean_shadow_recs =============
5/7
00:18:45 (pid:4012) Getting monitoring info for pid 4012
5/7 00:18:45
(pid:4012) DaemonCore: in SendAliveToParent()
5/7 00:18:45 (pid:4012)
DaemonCore: attempting to connect to '<172.26.21.23:1916>'
5/7 00:18:55
(pid:4012) -------- Begin starting jobs --------
5/7 00:18:56 (pid:4012)
-------- Done starting jobs --------
*** End of file
SchedLog
Schedd
crashes -email chain #3
This is an automated email from the Condor system
on machine "nbs60.bbnet.ad". Do not reply.
"D:\condor/bin/condor_schedd.exe" on
"nbs60.bbnet.ad" exited with status 44.
Condor will automatically restart
this process in 13 seconds.
*** Last 20 line(s) of file SchedLog:
5/7
00:18:45 (pid:4012) JobsHeld = 0
5/7 00:18:45 (pid:4012) JobsRemoved =
0
5/7 00:18:45 (pid:4012) LocalUniverseJobsRunning = 0
5/7 00:18:45
(pid:4012) LocalUniverseJobsIdle = 0
5/7 00:18:45 (pid:4012)
SchedUniverseJobsRunning = 0
5/7 00:18:45 (pid:4012) SchedUniverseJobsIdle =
0
5/7 00:18:45 (pid:4012) N_Owners = 0
5/7 00:18:45 (pid:4012)
MaxJobsRunning = 200
5/7 00:18:45 (pid:4012) Trying to update collector
<172.26.21.99:9618>
5/7 00:18:45 (pid:4012) Attempting to send update
via UDP to collector nbs40.bbnet.ad <172.26.21.99:9618>
5/7 00:18:45 (pid:4012) File descriptor limits: max
2000, safe 1600
5/7 00:18:45 (pid:4012) Ignoring file descriptor safety limit
(1600), because only 4 sockets are registered (fd is 1776)
5/7 00:18:45 (pid:4012) Sent HEART BEAT ad to 1
collectors. Number of submittors=0
5/7 00:18:45 (pid:4012) ============ Begin
clean_shadow_recs =============
5/7 00:18:45 (pid:4012) ============ End
clean_shadow_recs =============
5/7 00:18:45 (pid:4012) Getting monitoring
info for pid 4012
5/7 00:18:45 (pid:4012) DaemonCore: in
SendAliveToParent()
5/7 00:18:45 (pid:4012) DaemonCore: attempting to connect
to '<172.26.21.23:1916>'
5/7 00:18:55 (pid:4012) -------- Begin
starting jobs --------
5/7 00:18:56 (pid:4012) -------- Done starting jobs
--------
*** End of file SchedLog
Schedd
crashes - email chain #4
This is an
automated email from the Condor system on machine "nbs60.bbnet.ad". Do not
reply.
"D:\condor/bin/condor_startd.exe" on
"nbs60.bbnet.ad" exited with status 44.
Condor will automatically restart
this process in 10 seconds.
*** Last 20 line(s) of file StartLog:
5/7
01:41:59 no loadavg samples this minute, maybe thread died???
5/7 01:42:05
loadavg thread died, restarting. (exit code=2)
5/7 01:42:10 no loadavg
samples this minute, maybe thread died???
5/7 01:42:16 loadavg thread died,
restarting. (exit code=2)
5/7 01:42:21 no loadavg samples this minute, maybe
thread died???
5/7 01:42:27 loadavg thread died, restarting. (exit
code=2)
5/7 01:42:32 no loadavg samples this minute, maybe thread
died???
5/7 01:42:37 loadavg thread died, restarting. (exit code=2)
5/7
01:42:43 no loadavg samples this minute, maybe thread died???
5/7 01:42:48
loadavg thread died, restarting. (exit code=2)
5/7 01:42:54 no loadavg
samples this minute, maybe thread died???
5/7 01:42:59 loadavg thread died,
restarting. (exit code=2)
5/7 01:43:05 no loadavg samples this minute, maybe
thread died???
5/7 01:43:10 loadavg thread died, restarting. (exit
code=2)
5/7 01:43:16 no loadavg samples this minute, maybe thread
died???
5/7 01:43:21 loadavg thread died, restarting. (exit code=2)
5/7
01:43:27 no loadavg samples this minute, maybe thread died???
5/7 01:43:32
loadavg thread died, restarting. (exit code=2)
5/7 01:43:37 no loadavg
samples this minute, maybe thread died???
5/7 01:43:43 loadavg thread died,
restarting. (exit code=2)
*** End of file StartLog
Startd
crashes - email chain #1
This is an automated
email from the Condor system on machine "nbs50.bbnet.ad". Do not
reply.
"D:\condor/bin/condor_startd.exe" on "nbs50.bbnet.ad" exited with status
44.
Condor will automatically restart this process in 10
seconds.
*** Last 20 line(s)
of file StartLog:
5/19 19:32:36 Trying to update collector
<172.26.21.99:9618>
5/19 19:32:36 Attempting to send update via UDP to
collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 19:32:36 vm2: Sent
update to 1 collector(s)
5/19 19:32:42 DaemonCore: Command received via UDP
from host <172.26.21.75:1619>
5/19 19:32:42 DaemonCore: received
command 441 (ALIVE), calling handler (command_handler)
5/19 19:32:42
DaemonCore: Command received via UDP from host <172.26.21.75:1638>
5/19
19:32:42 DaemonCore: received command 441 (ALIVE), calling handler
(command_handler)
5/19 19:34:28 DaemonCore: Command received via UDP from
host <172.26.21.13:2342>
5/19 19:34:28 DaemonCore: received command
60008 (DC_CHILDALIVE), calling handler
(HandleChildAliveCommand)
5/19 19:36:16
Getting monitoring info for pid 1536
5/19 19:37:31 Swap space:
2724296
5/19 19:37:31 Looking up RESERVED_DISK parameter
5/19 19:37:31
Reserving 5120 kbytes for file system
5/19 19:37:31 Disk space:
132614452
5/19 19:37:35 Trying to update collector
<172.26.21.99:9618>
5/19 19:37:35 Attempting to send update via UDP to
collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 19:37:35 vm1: Sent
update to 1 collector(s)
5/19 19:37:36 Trying to update collector
<172.26.21.99:9618>
5/19 19:37:36 Attempting to send update via UDP to
collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 19:37:36 vm2: Sent
update to 1 collector(s)
*** End of file StartLog
Startd
crashes - email chain #2
This is an automated
email from the Condor system on machine "nbs50.bbnet.ad". Do not
reply.
"D:\condor/bin/condor_startd.exe" on "nbs50.bbnet.ad" died due to
exception ACCESS_VIOLATION.
Condor will automatically restart this process in
11 seconds.
*** Last 20 line(s)
of file StartLog:
5/19 20:03:01 ** condor_startd.exe (CONDOR_STARTD) STARTING
UP
5/19 20:03:01 ** D:\condor\bin\condor_startd.exe
5/19 20:03:01 **
$CondorVersion: 6.8.4 Feb 1 2007 $
5/19 20:03:01 ** $CondorPlatform:
INTEL-WINNT50 $
5/19 20:03:01 ** PID = 4084
5/19 20:03:01 ** Log last
touched 5/19 20:02:47
5/19 20:03:01
******************************************************
5/19 20:03:01 Using
config source: D:\Condor\condor_config
5/19 20:03:01 Using local config
sources:
5/19 20:03:01
D:\condor/condor_config.local
5/19 20:03:01 DaemonCore: Command Socket at
<172.26.21.13:2574>
5/19 20:03:01 Memory: Detected 2038 megs
RAM
5/19 20:03:01 my_popen: CreateProcess failed
5/19 20:03:01 Failed to
execute D:\condor/bin/condor_starter.pvm.exe, ignoring
5/19 20:03:01
my_popen: CreateProcess failed
5/19 20:03:01 Failed to execute
D:\condor/bin/condor_starter.std.exe, ignoring
5/19 20:03:01 Will use UDP to
update collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 20:03:01
command_x_event() called.
5/19 20:03:01 Attempting to remove
D:\condor\execute\dir_3172 as SuperUser (system)
5/19 20:03:01 my_popen:
CreateProcess failed
*** End of file StartLog
*** Last entry in
core file core.STARTD.WIN32
========================
Exception code: C0000005 ACCESS_VIOLATION
Fault address: 0048D517 01:0008C517
D:\condor\bin\condor_startd.exe
Registers:
EAX:FFFFFFFF
EBX:00000001
ECX:004E4898
EDX:00C700C0
ESI:00000000
EDI:FFFFFFFF
CS:EIP:001B:0048D517
SS:ESP:0023:0012F224
EBP:0012F240
DS:0023 ES:0023 FS:003B
GS:0000
Flags:00010286
Call
stack:
Address Frame Logical addr
Module
0048D517 0012F240 0001:0008C517
D:\condor\bin\condor_startd.exe
0042C3D4 0012F2B0 0001:0002B3D4
D:\condor\bin\condor_startd.exe 0042D93B 0012F348 0001:0002C93B
D:\condor\bin\condor_startd.exe 0042D91B 0012F390 0001:0002C91B
D:\condor\bin\condor_startd.exe
0042D8A6 0012F3B4 0001:0002C8A6
D:\condor\bin\condor_startd.exe
00417138 0012FDEC 0001:00016138
D:\condor\bin\condor_startd.exe
*** End of file
core.STARTD.WIN32
Startd
crashes - email chain #3
This is an automated email from the Condor system on machine
"nbs50.bbnet.ad". Do not reply.
"D:\condor/bin/condor_startd.exe" on "nbs50.bbnet.ad" died due to exception
ACCESS_VIOLATION.
Condor will automatically restart this process in 13
seconds.
*** Last 20 line(s) of file StartLog:
5/19 20:03:16 ** condor_startd.exe
(CONDOR_STARTD) STARTING UP
5/19 20:03:16 **
D:\condor\bin\condor_startd.exe
5/19 20:03:16 ** $CondorVersion: 6.8.4
Feb 1 2007 $
5/19 20:03:16 ** $CondorPlatform: INTEL-WINNT50 $
5/19
20:03:16 ** PID = 3264
5/19 20:03:16 ** Log last touched 5/19
20:03:01
5/19 20:03:16
******************************************************
5/19 20:03:16 Using
config source: D:\Condor\condor_config
5/19 20:03:16 Using local config
sources:
5/19 20:03:16
D:\condor/condor_config.local
5/19 20:03:16 DaemonCore: Command Socket at
<172.26.21.13:2581>
5/19 20:03:16 Memory: Detected 2038 megs
RAM
5/19 20:03:16 my_popen: CreateProcess failed
5/19 20:03:16 Failed to
execute D:\condor/bin/condor_starter.pvm.exe, ignoring
5/19 20:03:16
my_popen: CreateProcess failed
5/19 20:03:16 Failed to execute
D:\condor/bin/condor_starter.std.exe, ignoring
5/19 20:03:16 Will use UDP to
update collector nbs40.bbnet.ad <172.26.21.99:9618>
5/19 20:03:16
command_x_event() called.
5/19 20:03:16 Attempting to remove
D:\condor\execute\dir_3172 as SuperUser (system)
5/19 20:03:16 my_popen:
CreateProcess failed
*** End of file StartLog
*** Last entry in core file core.STARTD.WIN32
========================
Exception code: C0000005 ACCESS_VIOLATION Fault
address: 0048D517 01:0008C517 D:\condor\bin\condor_startd.exe
Registers:
EAX:FFFFFFFF
EBX:00000001
ECX:004E4898
EDX:00C700C0
ESI:00000000
EDI:FFFFFFFF
CS:EIP:001B:0048D517
SS:ESP:0023:0012F224
EBP:0012F240
DS:0023 ES:0023 FS:003B
GS:0000
Flags:00010286
Call stack:
Address Frame Logical
addr Module
0048D517 0012F240 0001:0008C517
D:\condor\bin\condor_startd.exe
0042C3D4 0012F2B0 0001:0002B3D4
D:\condor\bin\condor_startd.exe 0042D93B 0012F348 0001:0002C93B
D:\condor\bin\condor_startd.exe 0042D91B 0012F390 0001:0002C91B
D:\condor\bin\condor_startd.exe
0042D8A6 0012F3B4 0001:0002C8A6
D:\condor\bin\condor_startd.exe
00417138 0012FDEC 0001:00016138
D:\condor\bin\condor_startd.exe
*** End of file core.STARTD.WIN32
Best Regards,
Rick
Conexant E-mail Firewall (Conexant.Com) made the following annotations
---------------------------------------------------------------------
********************** Legal Disclaimer ****************************
"This email may contain confidential and privileged material for the sole use of the intended recipient. Any unauthorized review, use or distribution by others is strictly prohibited. If you have received the message in error, please advise the sender by reply email and delete the message. Thank you."
**********************************************************************
---------------------------------------------------------------------