[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Intermittent Condor startd crashes



Hi Nick,

Sorry about the delay. I had to try to reproduce the situation which is both easy and tricky. In the StartLog, I start to receive messages like

08/28 16:36:41 slot3: Called deactivate_claim_forcibly()
08/28 16:36:41 Starter pid 85829 exited with status 0
08/28 16:36:41 slot3: State change: starter exited
08/28 16:36:41 slot3: Changing activity: Busy -> Idle
08/28 16:36:42 slot3: Got activate_claim request from shadow (<192.168.10.16:641
36>)
08/28 16:36:42 slot3: Remote job ID is 36563.0
08/28 16:36:42 slot3: Got universe "VANILLA" (5) from request classad
08/28 16:36:42 slot3: State change: claim-activation protocol successful
08/28 16:36:42 slot3: Changing activity: Idle -> Busy
08/28 16:36:42 Starter pid 89438 exited with status 1
08/28 16:36:42 slot3: State change: starter exited
08/28 16:36:42 slot3: Changing activity: Busy -> Idle
08/28 16:36:42 slot3: Got activate_claim request from shadow (<192.168.10.16:641
39>)
08/28 16:36:42 slot3: Remote job ID is 36563.0
08/28 16:36:42 slot3: Got universe "VANILLA" (5) from request classad
08/28 16:36:42 slot3: State change: claim-activation protocol successful
08/28 16:36:42 slot3: Changing activity: Idle -> Busy
08/28 16:36:42 Starter pid 89439 exited with status 1

You'll notice initially the pid exits with status 0, but then everything starts to exit with status 1, which indicates that the starter for slot3 can no longer be launched. (The jobs being submitted are identical and run when executed on the other machines in the pool.) The other slots eventually do the same thing and are never able to launch jobs.

My visit to the Condor team on Friday was informative though, and I think I found a potential source of the issue. I had to start condor_master by hand on this one machine and the condor user is stored in an LDAP directory. I think that what I'm seeing is related to a ticket (#294). The other machines had condor started at boot time using an OS X StartupItem. I'm going to see if I can fix the problem by starting Condor using launchd instead. If that works, I'll post the launchd scripts.

    Craig

On Aug 26, 2009, at 11:24 AM, Nick LeRoy wrote:

On Wednesday 26 August 2009, Craig Struble wrote:
Craig,

Well, I had hoped that <8 slots would fix things, but after running
Condor longer, even 4 slots fails on this one OS X machine (while the
other 22 with 2 slots each run fine, running the same operating system
and condor binaries).

I'm not sure my problem is directly related, being on OS X. In the
StarterLog.slot1 on my machine, the end looks like:

08/22 10:22:19 Job 26912.0 set to execute immediately
08/22 10:22:19 Starting a VANILLA universe job with ID: 26912.0
08/22 10:22:19 IWD: /var/condor/execute/dir_94482
08/22 10:22:19 Output file: /var/condor/execute/dir_94482/
job_cluster-2.stdout
08/22 10:22:20 About to exec /var/condor/execute/dir_94482/
condor_exec.exe cluster_wrapper job_cluster-2.data job- 9 16
08/22 10:22:20 Create_Process succeeded, pid=94490
08/22 11:14:59 Process exited, pid=94490, status=0
08/22 11:14:59 Got SIGQUIT.  Performing fast shutdown.
08/22 11:14:59 ShutdownFast all jobs.
08/22 11:14:59 **** condor_starter (condor_STARTER) pid 94482 EXITING
WITH STATUS 0

After that, no jobs will run on that slot and running condor_restart
fails to relaunch condor (all daemons except condor_master are killed
but execing new ones fails for some unknown reason).

Just to be clear... The startd has crashed before the starter gets the QUIT? And, after that, the master can't even exec daemons? Is that right? Is
there anything interesting in the MasterLog or StartLog?

-Nick

--
          <<< The Matrix is everywhere. >>>
/`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
\    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
|_*_| 608-265-5761 Department of Computer Sciences

--
Craig A. Struble, Ph.D. | 369 Cudahy Hall  | Marquette University
Associate Professor of Computer Science    | (414)288-3783
Director, Master of Bioinformatics Program | (414)288-5472 (fax)
http://www.mscs.mu.edu/~cstruble | craig.struble@xxxxxxxxxxxxx