I think there are multiple
compounding problems here, and the disappearance of lock files is just
one part (this may be due to tmpwatch). Looking in the MasterLog file
on the worker node, I have the following lines: 3/31 15:23:51 Started DaemonCore process "/se/app/shared/condor/sbin/condor_startd", pid and pgroup = 26507 3/31 15:26:34 The STARTD (pid 26507) exited with status 4 3/31 15:26:34 restarting /se/app/shared/condor/sbin/condor_startd in 3600 seconds 3/31 16:00:24 DaemonCore: PERMISSION DENIED to unknown user from host <10.0.10.41:41217> for command 454 (DAEMONS_OFF), access level ADMINISTRATOR 3/31 16:00:30 DaemonCore: PERMISSION DENIED to unknown user from host <10.0.10.41:34627> for command 455 (DAEMONS_ON), access level ADMINISTRATOR I have pretty much turned off any access control via the HOSTALLOW_* = * method in condor_config, however I heard some people say there is a new shared secret mechanism. Is this enabled by default in Condor v7? For reference, 10.0.10.41 is the localhost for the worker node where these log file entries are taken from (i.e. it is denying a local user). It seems the 3600 second delay for restarting STARTD is due to a back-off algorithm -- in the log I can see earlier restarts had a shorter delay. Cheers, Ian Greg Quinn wrote: Ian, I've seen problems like this in the past when there is a process running that periodically deletes things in /tmp. Specifically, the Condor ProcD daemon uses the configured LOCK directory to place named pipes over which to communicate. If any of these pipes are externally deleted from under the ProcD, errors like the ones you are seeing can result. Greg On Mon, 2008-03-31 at 15:12 -0400, Ian Stokes-Rees wrote:I am getting a repeated sequence of errors on my worker nodes where STARTD aborts due to a "fatal exception". I only have two worker nodes, and they are both doing these. An extract of StartLog is below. I am running 7.0.1. STARTD on the cluster head nodes does work and jobs run there without a problem. Suggestions as to why this is happening (appears to be due to "error opening watchdog pipe", but I can't be certain), and how to resolve it would be greatly appreciated. Cheers, Ian 3/31 14:18:12 slot3: State change: claiming protocol successful 3/31 14:18:12 slot3: Changing state: Matched -> Claimed 3/31 14:18:14 slot3: Got activate_claim request from shadow (<10.0.10.39:55786>) 3/31 14:18:14 slot3: Remote job ID is 1593.0 3/31 14:18:15 error opening watchdog pipe /tmp/condor-lock.mackenzie0.0513363986547155/procd_pipe.STARTD.watchdog: No such file or directory (2) 3/31 14:18:15 ProcFamilyClient: error initializing LocalClient 3/31 14:18:15 ProcFamilyProxy: error initializing ProcFamilyClient 3/31 14:18:15 ERROR "ProcD has failed" at line 590 in file proc_family_proxy.C 3/31 14:18:15 slot3: Changing state and activity: Claimed/Idle -> Preempting/Killing 3/31 14:18:15 slot3: State change: No preempting claim, returning to owner 3/31 14:18:15 slot3: Changing state and activity: Preempting/Killing -> Owner/Idle 3/31 14:18:15 slot3: State change: IS_OWNER is false 3/31 14:18:15 slot3: Changing state: Owner -> Unclaimed 3/31 14:18:15 startd exiting because of fatal exception. -- Ian Stokes-Rees W: http://sbgrid.org ijstokes@xxxxxxxxxxxxxxxxxxx T: +1 617 418-4168 SBGrid, Harvard Medical School F: +1 617 432-5600 _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/ -- Ian Stokes-Rees W: http://sbgrid.org ijstokes@xxxxxxxxxxxxxxxxxxx T: +1 617 418-4168 SBGrid, Harvard Medical School F: +1 617 432-5600 |