[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SCHEDD crash - "write job_queue.log failed, errno = 2"



It was found that Condor Schedd crashed during the system backup, at 
which time the disk was just too busy. It was resolved by throttling 
the backup speed. While it is understandable, I still hope Condor may 
tolerate the slow disk IO just as other applications.

Junjun

On Wednesday 01 November 2006 16:21, Junjun Mao wrote:
> Jobs in condor queue were restarted over and over around 2am, almost
> everyday. Further investigation revealed the scheduler of the central
> manager was crashed and restarted.
>
> Here is what in MasterLog:
> 10/13 13:52:51 Child 15457 died, but not a daemon -- Ignored
> 10/14 02:02:21 The SCHEDD (pid 10264) exited with status 4
> 10/14 02:02:21 Sending obituary for
> "/home2/condor/sbin/condor_schedd" 10/14 02:02:21 restarting
> /home2/condor/sbin/condor_schedd in 10 seconds 10/14 02:02:31 Started
> DaemonCore
> process "/home2/condor/sbin/condor_schedd", pid and pgroup = 5510
> 10/14 13:52:48 Preen pid is 11906
> 10/14 13:52:51 Child 11906 died, but not a daemon -- Ignored
> 10/15 13:52:48 Preen pid is 3619
> 10/15 13:52:52 Child 3619 died, but not a daemon -- Ignored
> 10/16 02:02:29 The SCHEDD (pid 5510) exited with status 4
> 10/16 02:02:29 Sending obituary for
> "/home2/condor/sbin/condor_schedd" 10/16 02:02:30 restarting
> /home2/condor/sbin/condor_schedd in 10 seconds 10/16 02:02:40 Started
> DaemonCore
> process "/home2/condor/sbin/condor_schedd", pid and pgroup = 19703
> 10/16 11:20:15 DaemonCore: Command received via TCP from host
> <10.10.20.1:41725>
>
> In SchedLog
> 10/16 02:02:27 (pid:5510) Sent ad to 1 collectors for
> cbriscoe@xxxxxxxx 10/16 02:02:29 (pid:5510) ERROR "write
> to /home2/condor/hosts/master1/spool/job_queue.log failed, errno = 2"
> at line 150 in file classad_log.C
> 10/16 02:02:41 (pid:19703)
> ******************************************************
> 10/16 02:02:41 (pid:19703) ** condor_schedd (CONDOR_SCHEDD) STARTING
> UP 10/16 02:02:41 (pid:19703) ** /home2/condor/sbin/condor_schedd
> 10/16 02:02:41 (pid:19703) ** $CondorVersion: 6.7.18 Mar 22 2006 $
> 10/16 02:02:41 (pid:19703) ** $CondorPlatform: I386-LINUX_RH9 $ 10/16
> 02:02:41 (pid:19703) ** PID = 19703
> 10/16 02:02:41 (pid:19703)
> ******************************************************
>
> I don't understand errno = 2 for the job_queue.log file, which has
> the right permission and is not too big. This file is on a shared
> file system.
> -rw-------    1 condor   condor     836527 Oct 16
> 2006 /home2/condor/hosts/master1/spool/job_queue.log
>
> Any ideas?
>
> Junjun
>
>
>
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at either
> https://lists.cs.wisc.edu/archive/condor-users/
> http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

-- 
Dr. Junjun Mao
Research Associate

Steinman Hall, #1M-11
Benjamin Levich Institute for Physico-Chemical Hydrodynamics
City College of CUNY
140th Street & Convent Avenue
New York, NY 10031

(212) 650-6845; (212) 650-6835 (fax)