Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Intermittent Condor startd crashes
- Date: Mon, 31 Aug 2009 13:26:47 -0500
- From: Craig Struble <craig.struble@xxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Intermittent Condor startd crashes
Hi Nick,
Sorry about the delay. I had to try to reproduce the situation which
is both easy and tricky. In the StartLog, I start to receive messages
like
08/28 16:36:41 slot3: Called deactivate_claim_forcibly()
08/28 16:36:41 Starter pid 85829 exited with status 0
08/28 16:36:41 slot3: State change: starter exited
08/28 16:36:41 slot3: Changing activity: Busy -> Idle
08/28 16:36:42 slot3: Got activate_claim request from shadow
(<192.168.10.16:641
36>)
08/28 16:36:42 slot3: Remote job ID is 36563.0
08/28 16:36:42 slot3: Got universe "VANILLA" (5) from request classad
08/28 16:36:42 slot3: State change: claim-activation protocol successful
08/28 16:36:42 slot3: Changing activity: Idle -> Busy
08/28 16:36:42 Starter pid 89438 exited with status 1
08/28 16:36:42 slot3: State change: starter exited
08/28 16:36:42 slot3: Changing activity: Busy -> Idle
08/28 16:36:42 slot3: Got activate_claim request from shadow
(<192.168.10.16:641
39>)
08/28 16:36:42 slot3: Remote job ID is 36563.0
08/28 16:36:42 slot3: Got universe "VANILLA" (5) from request classad
08/28 16:36:42 slot3: State change: claim-activation protocol successful
08/28 16:36:42 slot3: Changing activity: Idle -> Busy
08/28 16:36:42 Starter pid 89439 exited with status 1
You'll notice initially the pid exits with status 0, but then
everything starts to exit with status 1, which indicates that the
starter for slot3 can no longer be launched. (The jobs being submitted
are identical and run when executed on the other machines in the
pool.) The other slots eventually do the same thing and are never able
to launch jobs.
My visit to the Condor team on Friday was informative though, and I
think I found a potential source of the issue. I had to start
condor_master by hand on this one machine and the condor user is
stored in an LDAP directory. I think that what I'm seeing is related
to a ticket (#294). The other machines had condor started at boot time
using an OS X StartupItem. I'm going to see if I can fix the problem
by starting Condor using launchd instead. If that works, I'll post the
launchd scripts.
Craig
On Aug 26, 2009, at 11:24 AM, Nick LeRoy wrote:
On Wednesday 26 August 2009, Craig Struble wrote:
Craig,
Well, I had hoped that <8 slots would fix things, but after running
Condor longer, even 4 slots fails on this one OS X machine (while the
other 22 with 2 slots each run fine, running the same operating
system
and condor binaries).
I'm not sure my problem is directly related, being on OS X. In the
StarterLog.slot1 on my machine, the end looks like:
08/22 10:22:19 Job 26912.0 set to execute immediately
08/22 10:22:19 Starting a VANILLA universe job with ID: 26912.0
08/22 10:22:19 IWD: /var/condor/execute/dir_94482
08/22 10:22:19 Output file: /var/condor/execute/dir_94482/
job_cluster-2.stdout
08/22 10:22:20 About to exec /var/condor/execute/dir_94482/
condor_exec.exe cluster_wrapper job_cluster-2.data job- 9 16
08/22 10:22:20 Create_Process succeeded, pid=94490
08/22 11:14:59 Process exited, pid=94490, status=0
08/22 11:14:59 Got SIGQUIT. Performing fast shutdown.
08/22 11:14:59 ShutdownFast all jobs.
08/22 11:14:59 **** condor_starter (condor_STARTER) pid 94482 EXITING
WITH STATUS 0
After that, no jobs will run on that slot and running condor_restart
fails to relaunch condor (all daemons except condor_master are killed
but execing new ones fails for some unknown reason).
Just to be clear... The startd has crashed before the starter gets
the QUIT?
And, after that, the master can't even exec daemons? Is that
right? Is
there anything interesting in the MasterLog or StartLog?
-Nick
--
<<< The Matrix is everywhere. >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy@xxxxxxxxxxx The University of Wisconsin
|_*_| 608-265-5761 Department of Computer
Sciences
--
Craig A. Struble, Ph.D. | 369 Cudahy Hall | Marquette University
Associate Professor of Computer Science | (414)288-3783
Director, Master of Bioinformatics Program | (414)288-5472 (fax)
http://www.mscs.mu.edu/~cstruble | craig.struble@xxxxxxxxxxxxx