Jobs in condor queue were restarted over and over around 2am, almost
everyday. Further investigation revealed the scheduler of the central
manager was crashed and restarted.
Here is what in MasterLog:
10/13 13:52:51 Child 15457 died, but not a daemon -- Ignored
10/14 02:02:21 The SCHEDD (pid 10264) exited with status 4
10/14 02:02:21 Sending obituary for
"/home2/condor/sbin/condor_schedd" 10/14 02:02:21 restarting
/home2/condor/sbin/condor_schedd in 10 seconds 10/14 02:02:31 Started
DaemonCore
process "/home2/condor/sbin/condor_schedd", pid and pgroup = 5510
10/14 13:52:48 Preen pid is 11906
10/14 13:52:51 Child 11906 died, but not a daemon -- Ignored
10/15 13:52:48 Preen pid is 3619
10/15 13:52:52 Child 3619 died, but not a daemon -- Ignored
10/16 02:02:29 The SCHEDD (pid 5510) exited with status 4
10/16 02:02:29 Sending obituary for
"/home2/condor/sbin/condor_schedd" 10/16 02:02:30 restarting
/home2/condor/sbin/condor_schedd in 10 seconds 10/16 02:02:40 Started
DaemonCore
process "/home2/condor/sbin/condor_schedd", pid and pgroup = 19703
10/16 11:20:15 DaemonCore: Command received via TCP from host
<10.10.20.1:41725>
In SchedLog
10/16 02:02:27 (pid:5510) Sent ad to 1 collectors for
cbriscoe@xxxxxxxx 10/16 02:02:29 (pid:5510) ERROR "write
to /home2/condor/hosts/master1/spool/job_queue.log failed, errno = 2"
at line 150 in file classad_log.C
10/16 02:02:41 (pid:19703)
******************************************************
10/16 02:02:41 (pid:19703) ** condor_schedd (CONDOR_SCHEDD) STARTING
UP 10/16 02:02:41 (pid:19703) ** /home2/condor/sbin/condor_schedd
10/16 02:02:41 (pid:19703) ** $CondorVersion: 6.7.18 Mar 22 2006 $
10/16 02:02:41 (pid:19703) ** $CondorPlatform: I386-LINUX_RH9 $ 10/16
02:02:41 (pid:19703) ** PID = 19703
10/16 02:02:41 (pid:19703)
******************************************************
I don't understand errno = 2 for the job_queue.log file, which has
the right permission and is not too big. This file is on a shared
file system.
-rw------- 1 condor condor 836527 Oct 16
2006 /home2/condor/hosts/master1/spool/job_queue.log
Any ideas?
Junjun
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR