Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[condor-users] shadows keep dying problem
- Date: Thu, 27 Nov 2003 09:41:06 +0000
- From: Henry Knowles <Henry.Knowles@xxxxxxxxxxxxx>
- Subject: [condor-users] shadows keep dying problem
Hi,
This is starting to frustrate me now, so I'm hoping someone else will
be able to help. This problem has appeared from nowhere, and in
addition no one else using the pool appears to be suffering from it!
The symptoms are that my jobs are submitted and then start to run. A
fraction of them complete OK, but the rest seem to lose contact and
after an hour (usually), they get cleaned up and restarted, with the
SchedLog entry at the end.
In addition, the ShadowLog is full of lines looking like:
11/27 09:34:54 (525.19) (2088): GlobalGroupMember: artsanybody
where the final 'artsandybody' is a selection of usernames from around
the university. We think part of the problem may be that the pool is
quite busy, and so the UDP packets may be getting lost. Does anyone
have any suggestions? We're running Condor 6.4.7 on WinXP.
Thanks in advance,
Henry
SchedLog:
========
11/27 09:22:22 DaemonCore: Command received via TCP from host
<137.222.189.138:1701>
11/27 09:22:22 DaemonCore: received command 1111 (QMGMT_CMD), calling
handler (handle_q)
11/27 09:22:22 QMGR Connection closed
11/27 09:22:50 DaemonCore: Command received via TCP from host
<137.222.189.138:1702>
11/27 09:22:50 DaemonCore: received command 1111 (QMGMT_CMD), calling
handler (handle_q)
11/27 09:22:50 QMGR Connection closed
11/27 09:22:57 ERROR: Child pid 2724 appears hung! Killing it hard.
11/27 09:22:57 DaemonCore: Command received via UDP from host
<137.222.189.138:1704>
11/27 09:22:57 DaemonCore: received command 60001 (DC_PROCESSEXIT),
calling handler (HandleProcessExitCommand())
11/27 09:22:57 Shadow pid 2724 successfully killed because it was hung.
11/27 09:22:57 Shadow pid 2724 for job 525.23 exited with status 4
11/27 09:22:57 ERROR: Shadow exited with job exception code!
11/27 09:22:57 Match for cluster 525 has had 5 shadow exceptions,
relinquishing.
11/27 09:22:57 Called send_vacate( <137.222.97.31:1037>, 443 )
11/27 09:22:57 Sent RELEASE_CLAIM to startd on <137.222.97.31:1037>
11/27 09:22:57 Match record (<137.222.97.31:1037>, 525, 23) deleted
11/27 09:22:57 Capability of deleted match:
<137.222.97.31:1037>#2026321860
11/27 09:22:57 Entered delete_shadow_rec( 2724 )
11/27 09:22:57 Deleting shadow rec for PID 2724, job (525.23)
11/27 09:22:58 Entered check_zombie( 2724, 0x8a0ca4, st=2 )
11/27 09:22:58 Marked job 525.23 as IDLE
11/27 09:22:58 Exited check_zombie( 2724, 0x8a0ca4 )
11/27 09:22:58 Shadow does not have a match record, so did not remove
it from the match
11/27 09:22:58
11/27 09:22:58 ..................
11/27 09:22:58 .. Shadow Recs (10/10)
11/27 09:22:58 .. 2224, 525.25, F, <137.222.97.47:3224>, cur_hosts=1,
status=2
11/27 09:22:58 .. 2516, 525.24, F, <137.222.97.36:1037>, cur_hosts=1,
status=2
11/27 09:22:58 .. 1284, 525.15, F, <137.222.97.71:3835>, cur_hosts=1,
status=2
11/27 09:22:58 .. 3656, 525.6, F, <137.222.97.84:1085>, cur_hosts=1,
status=2
11/27 09:22:58 .. 1732, 525.11, F, <137.222.97.53:3491>, cur_hosts=1,
status=2
11/27 09:22:58 .. 3204, 525.14, F, <137.222.97.22:1036>, cur_hosts=1,
status=2
11/27 09:22:58 .. 3304, 527.7, F, <137.222.97.21:1032>, cur_hosts=1,
status=2
11/27 09:22:58 .. 2480, 525.18, F, <137.222.97.91:1978>, cur_hosts=1,
status=2
11/27 09:22:58 .. 2088, 525.19, F, <137.222.97.57:1323>, cur_hosts=1,
status=2
11/27 09:22:58 .. 3496, 525.8, F, <137.222.97.101:4174>, cur_hosts=1,
status=2
11/27 09:22:58 ..................
----------------------
Henry Knowles, Electrical & Electronic Engineering
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>