I think there are multiple compounding problems here, and the
disappearance of lock files is just one part (this may be due to
tmpwatch). Looking in the MasterLog file on the worker node, I have
the following lines:
3/31 15:23:51 Started DaemonCore process
"/se/app/shared/condor/sbin/condor_startd", pid and pgroup = 26507
3/31 15:26:34 The STARTD (pid 26507) exited with status 4
3/31 15:26:34 restarting /se/app/shared/condor/sbin/condor_startd in
3600 seconds
3/31 16:00:24 DaemonCore: PERMISSION DENIED to unknown user from host
<10.0.10.41:41217> for command 454 (DAEMONS_OFF), access level
ADMINISTRATOR
3/31 16:00:30 DaemonCore: PERMISSION DENIED to unknown user from host
<10.0.10.41:34627> for command 455 (DAEMONS_ON), access level
ADMINISTRATOR
I have pretty much turned off any access control via the HOSTALLOW_* =
* method in condor_config, however I heard some people say there is a
new shared secret mechanism. Is this enabled by default in Condor
v7? For reference, 10.0.10.41 is the localhost for the worker node
where these log file entries are taken from (i.e. it is denying a
local user).
It seems the 3600 second delay for restarting STARTD is due to a
back-off algorithm -- in the log I can see earlier restarts had a
shorter delay.
Cheers,
Ian
Greg Quinn wrote:
Ian,
I've seen problems like this in the past when there is a process running
that periodically deletes things in /tmp. Specifically, the Condor ProcD
daemon uses the configured LOCK directory to place named pipes over
which to communicate. If any of these pipes are externally deleted from
under the ProcD, errors like the ones you are seeing can result.
Greg
On Mon, 2008-03-31 at 15:12 -0400, Ian Stokes-Rees wrote:
I am getting a repeated sequence of errors on my worker nodes where
STARTD aborts due to a "fatal exception". I only have two worker
nodes, and they are both doing these. An extract of StartLog is
below. I am running 7.0.1. STARTD on the cluster head nodes does
work and jobs run there without a problem.
Suggestions as to why this is happening (appears to be due to "error
opening watchdog pipe", but I can't be certain), and how to resolve it
would be greatly appreciated.
Cheers,
Ian
3/31 14:18:12 slot3: State change: claiming protocol successful
3/31 14:18:12 slot3: Changing state: Matched -> Claimed
3/31 14:18:14 slot3: Got activate_claim request from shadow
(<10.0.10.39:55786>)
3/31 14:18:14 slot3: Remote job ID is 1593.0
3/31 14:18:15 error opening watchdog
pipe /tmp/condor-lock.mackenzie0.0513363986547155/procd_pipe.STARTD.watchdog: No such file or directory (2)
3/31 14:18:15 ProcFamilyClient: error initializing LocalClient
3/31 14:18:15 ProcFamilyProxy: error initializing ProcFamilyClient
3/31 14:18:15 ERROR "ProcD has failed" at line 590 in file
proc_family_proxy.C
3/31 14:18:15 slot3: Changing state and activity: Claimed/Idle ->
Preempting/Killing
3/31 14:18:15 slot3: State change: No preempting claim, returning to
owner
3/31 14:18:15 slot3: Changing state and activity: Preempting/Killing
-> Owner/Idle
3/31 14:18:15 slot3: State change: IS_OWNER is false
3/31 14:18:15 slot3: Changing state: Owner -> Unclaimed
3/31 14:18:15 startd exiting because of fatal exception.
--
Ian Stokes-Rees W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx T: +1 617 418-4168
SBGrid, Harvard Medical School F: +1 617 432-5600
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
Ian Stokes-Rees W: http://sbgrid.org
ijstokes@xxxxxxxxxxxxxxxxxxx T: +1 617 418-4168
SBGrid, Harvard Medical School F: +1 617 432-5600
------------------------------------------------------------------------
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/