Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] sched goess offline, kills jobs
- Date: Mon, 05 Mar 2012 15:13:39 -0500
- From: Sarah Williams <saewill@xxxxxxxxx>
- Subject: Re: [Condor-users] sched goess offline, kills jobs
I should have mentioned, the condor head nodes run 7.6.0-1, and the
worker nodes run 7.6.4-1.
On 3/5/12 3:12 PM, Sarah Williams wrote:
> Hello,
>
> We saw an issue on our condor installation on Friday afternoon that
> killed all jobs in the cluster. Details are below. I'm looking to find
> out what happened, why it killed jobs, and how to keep it from happening
> again.
>
> The first symptom was that some of our monitoring software started to
> hang, because condor_q was hanging on
> queries against condor_sched.
>
> The NegiatorLog has several messages like:
> 03/02/12 16:23:24 condor_read(): timeout reading 5 bytes from schedd
> group_atlasprod.usatlas1@xxxxxxxxxxxxxxxx
> 03/02/12 16:23:24 IO: Failed to read packet header
> 03/02/12 16:23:24 Failed to get reply from schedd
> 03/02/12 16:23:24 Error: Ignoring submitter for this cycle
> 03/02/12 16:23:24 negotiateWithGroup resources used scheddAds length
>
> Finally, condor_q started failing all-together with the message:
> Error: Collector has no record of schedd/submitter
>
> At that point, I restarted condor on the gatekeeper, which runs
> condor_master and schedd. I've previously restarted condor on the
> gatekeeper, and even rebooted it, without dropping jobs. However, this
> time it didn't work that way. In worker-node logs I see messages like:
>
> 03/02/12 16:14:08 slot16: Failed to connect to schedd
> <128.135.158.146:39156>
> 03/02/12 16:14:11 slot8: State change: claim lease expired
> (condor_schedd gone?)
> 03/02/12 16:14:11 slot8: Changing state and activity: Claimed/Busy ->
> Preempting/Killing
>
> The SchedLog on osg-gk, oddly, shows nothing unusual during this time.
>
> I put a snapshot of the log files up here, in case anyone wants to
> browse them:
> http://www.mwt2.org/~sarah/condor/
>
> On Saturday, I got an email from condor on the manager saying that the
> condor_negotiator was killed because it was unresponsive. The email
> says the last lines of the NegotiatorLog were:
> 03/03/12 16:32:19 ---------- Started Negotiation Cycle ----------
> 03/03/12 16:32:19 Phase 1: Obtaining ads from collector ...
> 03/03/12 16:32:19 Getting all public ads ...
> 03/03/12 16:32:33 Sorting 6473 ads ...
> 03/03/12 16:32:33 Getting startd private ads ...
> 03/03/12 17:12:05 Got ads: 6473 public and 5941 private
> 03/03/12 17:12:05 Public ads include 8 submitter, 5941 startd
> 03/03/12 17:12:25 Phase 2: Performing accounting ...
>
> The email was sent at 17:22, which matches the time the MasterLog says
> it killed the negotiator process. I don't see any messages on the
> worker nodes saying that anything went into Preempting/Killing, so I
> assume this event did not kill any jobs.
>
> --Sarah