I have schedd that continually reports
"failed to fetch ads" when asked for it's state with condor_q. I looked in the
ScheddLog for the machine and I'm seeing lots and lots of errors. What could
have happened to put the machine in such an awful state?
- Ian
2/4 14:30:14 Tables are consistent
2/4 14:30:14 condor_write(): Socket
closed when trying to write buffer
2/4 14:30:14 Buf::write(): condor_write() failed
2/4 14:30:14 Can't send job eom to mgr
2/4 14:30:14 Negotiating for owner:
bchan@xxxxxxxxxx
2/4 14:30:14
Checking consistency running and runnable jobs
2/4 14:30:14 Tables are consistent
2/4 14:30:14 condor_write(): Socket
closed when trying to write buffer
2/4 14:30:14 Buf::write(): condor_write() failed
2/4 14:30:14 Can't send job eom to mgr
2/4 14:30:14 Shadow pid 17382 for job
11.28 exited with status 4
2/4
14:30:14 ERROR: Shadow exited with job exception code!
2/4 14:30:14 Started shadow for job 25.73 on
"<137.57.176.51:4846>", (shadow pid = 18388)
2/4 14:30:21 Sent ad to central manager for
bchan@xxxxxxxxxx
2/4 14:30:21
Sent ad to 1 collectors for bchan@xxxxxxxxxx
2/4 14:31:09 condor_read(): recv() returned -1,
errno = 104, assuming failure.
2/4 14:31:09 ERROR: Child pid 16415 appears hung! Killing it
hard.
2/4 14:31:09 ERROR: Child
pid 17381 appears hung! Killing it hard.
2/4 14:31:09 ERROR: Child pid 15970 appears hung! Killing it
hard.
2/4 14:31:09 ERROR: Child
pid 17350 appears hung! Killing it hard.
2/4 14:31:09 ERROR: Child pid 17293 appears hung! Killing it
hard.
2/4 14:31:09
condor_write(): Socket closed when trying to write buffer
2/4 14:31:09 Buf::write(): condor_write()
failed
2/4 14:31:09 AUTHENTICATE:
handshake failed!
2/4 14:31:09
SCHEDD: authentication failed: AUTHENTICATE:1002:Failure performing
handshake
2/4 14:31:09 Shadow pid 17381
successfully killed because it was hung.
2/4 14:31:09 Shadow pid 17381 died with signal 4
2/4 14:31:09 Started shadow for job 25.74 on
"<137.57.176.42:1998>", (shadow pid = 18497)
2/4 14:31:09 condor_write(): Socket closed when
trying to write buffer
2/4
14:31:09 Buf::write(): condor_write() failed
2/4 14:31:09 AUTHENTICATE: handshake
failed!
2/4 14:31:09 SCHEDD:
authentication failed: AUTHENTICATE:1002:Failure performing handshake
2/4 14:31:09 Shadow pid 17379 for job
11.26 exited with status 4
2/4
14:31:09 ERROR: Shadow exited with job exception code!
2/4 14:31:11 Started shadow for job 25.81 on
"<137.57.176.70:3975>", (shadow pid = 18499)
2/4 14:31:12 condor_write(): Socket closed when
trying to write buffer
2/4
14:31:12 Buf::write(): condor_write() failed
2/4 14:31:12 AUTHENTICATE: handshake
failed!
2/4 14:31:12 SCHEDD:
authentication failed: AUTHENTICATE:1002:Failure performing handshake
2/4 14:31:12 Shadow pid 17378 for job
14.1 exited with status 4
2/4
14:31:12 ERROR: Shadow exited with job exception code!
2/4 14:31:12 condor_write(): Socket closed when
trying to write buffer
2/4
14:31:12 Buf::write(): condor_write() failed
2/4 14:31:12 AUTHENTICATE: handshake
failed!
2/4 14:31:12 SCHEDD:
authentication failed: AUTHENTICATE:1002:Failure performing handshake
2/4 14:31:12 Shadow pid 17376 for job
11.25 exited with status 4
2/4
14:31:12 ERROR: Shadow exited with job exception code!
2/4 14:31:14 Started shadow for job 25.15 on
"<137.57.176.86:3838>", (shadow pid = 18501)
--
Ian R. Chesal <ichesal@xxxxxxxxxx>
Senior Software Engineer
Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300