Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Strange schedd crash (exit status 44)
- Date: Tue, 23 Nov 2004 14:44:58 -0500
- From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
- Subject: [Condor-users] Strange schedd crash (exit status 44)
I get a schedd crash from this users machine every time he queues up 100
or more jobs. What does exit status 44 indicate?
Thanks!
Ian
-----Original Message-----
From: SYSTEM@xxxxxxxxxx [mailto:SYSTEM@xxxxxxxxxx]
Sent: November 23, 2004 2:32 PM
To: SW TOR Batch System Admins
Subject: [Condor] Problem
This is an automated email from the Condor system on machine
"TTC-GQUAN3.altera.priv.altera.com". Do not reply.
"d:\abc\condor/bin/condor_schedd.exe" on
"TTC-GQUAN3.altera.priv.altera.com" exited with status 44.
Condor will automatically restart this process in 10 seconds.
*** Last 100 line(s) of file SchedLog:
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.180:1047>#1100637096#282" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.182:1151>#1099422886#1224" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.182:1151>#1099422886#1223" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.183:4197>#1099203124#1580" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.183:4197>#1099203124#1579" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.185:1407>#1099202749#1981" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.185:1407>#1099202749#1982" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.177:1213>#1100703290#277" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.186:2147>#1099203682#1256" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.177:1213>#1100703290#276" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.186:2147>#1099203682#1257" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.178:3591>#1099202664#1406" ignored
11/23 14:28:58 attempt to add pre-existing match
"<137.57.176.179:2712>#1099202607#1468" ignored
11/23 14:28:59 attempt to add pre-existing match
"<137.57.176.179:2712>#1099202607#1467" ignored
11/23 14:29:31 DaemonCore: Command received via UDP from host
<137.57.142.51:4119>
11/23 14:29:31 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/23 14:29:36 Started shadow for job 19.130 on "<137.57.176.179:2712>",
(shadow pid = 472)
11/23 14:29:36 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:29:36 timed out requesting claim from <137.57.176.180:1047>
11/23 14:29:36 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:29:36 timed out requesting claim from <137.57.176.180:1047>
11/23 14:29:36 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:29:40 DaemonCore: Command received via TCP from host
<137.57.176.179:4906>
11/23 14:29:40 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/23 14:29:40 Got VACATE_SERVICE from <137.57.176.179:4906>
11/23 14:29:40 Sent RELEASE_CLAIM to startd on <137.57.176.179:2712>
11/23 14:29:40 Match record (<137.57.176.179:2712>, 19, 130) deleted
11/23 14:29:40 DaemonCore: Command received via UDP from host
<137.57.142.51:4133>
11/23 14:29:40 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/23 14:29:40 Scheduler::Relinquish - mrec is NULL, can't relinquish
11/23 14:29:40 Null parameter --- match not deleted
11/23 14:29:44 Started shadow for job 19.159 on "<137.57.176.179:2712>",
(shadow pid = 2972)
11/23 14:29:44 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:29:45 timed out requesting claim from <137.57.176.180:1047>
11/23 14:29:45 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:29:45 timed out requesting claim from <137.57.176.180:1047>
11/23 14:29:45 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:02 DaemonCore: Command received via UDP from host
<137.57.142.51:4146>
11/23 14:30:02 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/23 14:30:05 condor_read(): recv() returned -1, errno = 10054,
assuming failure.
11/23 14:30:05 Response problem from startd.
11/23 14:30:05 Sent RELEASE_CLAIM to startd on <137.57.176.182:1151>
11/23 14:30:05 Match record (<137.57.176.182:1151>, 19, 129) deleted
11/23 14:30:07 Started shadow for job 19.130 on "<137.57.176.182:1151>",
(shadow pid = 1036)
11/23 14:30:07 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:30:07 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:08 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:08 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:08 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:13 DaemonCore: Command received via TCP from host
<137.57.176.182:4778>
11/23 14:30:13 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/23 14:30:13 Got VACATE_SERVICE from <137.57.176.182:4778>
11/23 14:30:13 Sent RELEASE_CLAIM to startd on <137.57.176.182:1151>
11/23 14:30:13 Match record (<137.57.176.182:1151>, 19, 130) deleted
11/23 14:30:13 DaemonCore: Command received via UDP from host
<137.57.142.51:4176>
11/23 14:30:13 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/23 14:30:13 Scheduler::Relinquish - mrec is NULL, can't relinquish
11/23 14:30:13 Null parameter --- match not deleted
11/23 14:30:17 Started shadow for job 19.133 on "<137.57.176.182:1151>",
(shadow pid = 2300)
11/23 14:30:17 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:30:17 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:17 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:17 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:17 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:42 DaemonCore: Command received via UDP from host
<137.57.142.51:4190>
11/23 14:30:42 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/23 14:30:45 Started shadow for job 19.130 on "<137.57.176.180:1047>",
(shadow pid = 3624)
11/23 14:30:45 Sent ad to 1 collectors for gquan@xxxxxxxxxx
11/23 14:30:45 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:46 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:46 timed out requesting claim from <137.57.176.180:1047>
11/23 14:30:46 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:52 DaemonCore: Command received via TCP from host
<137.57.176.180:3514>
11/23 14:30:52 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
11/23 14:30:52 Got VACATE_SERVICE from <137.57.176.180:3514>
11/23 14:30:52 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:52 Match record (<137.57.176.180:1047>, 19, 130) deleted
11/23 14:30:52 DaemonCore: Command received via UDP from host
<137.57.142.51:4204>
11/23 14:30:52 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/23 14:30:52 Scheduler::Relinquish - mrec is NULL, can't relinquish
11/23 14:30:52 Null parameter --- match not deleted
11/23 14:30:55 Response problem from startd.
11/23 14:30:55 Sent RELEASE_CLAIM to startd on <137.57.176.180:1047>
11/23 14:30:55 Match record (<137.57.176.180:1047>, 19, 131) deleted
11/23 14:30:56 Response problem from startd.
11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.185:1407>
11/23 14:30:56 Match record (<137.57.176.185:1407>, 19, 151) deleted
11/23 14:30:56 Response problem from startd.
11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.183:4197>
11/23 14:30:56 Match record (<137.57.176.183:4197>, 19, 147) deleted
11/23 14:30:56 Response problem from startd.
11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.183:4197>
11/23 14:30:56 Match record (<137.57.176.183:4197>, 19, 149) deleted
11/23 14:30:56 Response problem from startd.
11/23 14:30:56 Sent RELEASE_CLAIM to startd on <137.57.176.185:1407>
11/23 14:30:56 Match record (<137.57.176.185:1407>, 19, 150) deleted
11/23 14:30:57 Response problem from startd.
11/23 14:30:57 Sent RELEASE_CLAIM to startd on <137.57.176.186:2147>
11/23 14:30:57 Match record (<137.57.176.186:2147>, 19, 155) deleted
11/23 14:30:57 Started shadow for job 19.130 on "<137.57.176.180:1047>",
(shadow pid = 2692)
*** End of file SchedLog
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Questions about this message or Condor in general?
Email address of the local Condor administrator: swttcabca@xxxxxxxxxx
The Official Condor Homepage is http://www.cs.wisc.edu/condor