[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] sched goess offline, kills jobs

Date: Mon, 05 Mar 2012 15:13:39 -0500
From: Sarah Williams <saewill@xxxxxxxxx>
Subject: Re: [Condor-users] sched goess offline, kills jobs

I should have mentioned, the condor head nodes run 7.6.0-1, and the
worker nodes run 7.6.4-1.

On 3/5/12 3:12 PM, Sarah Williams wrote:
> Hello,
> 
> We saw an issue on our condor installation on Friday afternoon that
> killed all jobs in the cluster.  Details are below. I'm looking to find
> out what happened, why it killed jobs, and how to keep it from happening
> again.
> 
> The first symptom was that some of our monitoring software started to
> hang, because condor_q  was hanging on
> queries against condor_sched.
> 
> The NegiatorLog has several messages like:
> 03/02/12 16:23:24 condor_read(): timeout reading 5 bytes from schedd
> group_atlasprod.usatlas1@xxxxxxxxxxxxxxxx
> 03/02/12 16:23:24 IO: Failed to read packet header
> 03/02/12 16:23:24     Failed to get reply from schedd
> 03/02/12 16:23:24   Error: Ignoring submitter for this cycle
> 03/02/12 16:23:24  negotiateWithGroup resources used scheddAds length
> 
> Finally, condor_q started failing all-together with the message:
> Error: Collector has no record of schedd/submitter
> 
> At that point, I restarted condor on the gatekeeper, which runs
> condor_master and schedd.   I've previously restarted condor on the
> gatekeeper, and even rebooted it, without dropping jobs. However, this
> time it didn't work that way. In worker-node logs I see messages like:
> 
> 03/02/12 16:14:08 slot16: Failed to connect to schedd
> <128.135.158.146:39156>
> 03/02/12 16:14:11 slot8: State change: claim lease expired
> (condor_schedd gone?)
> 03/02/12 16:14:11 slot8: Changing state and activity: Claimed/Busy ->
> Preempting/Killing
> 
> The SchedLog on osg-gk, oddly, shows nothing unusual during this time.
> 
> I put a snapshot of the log files up here, in case anyone wants to
> browse them:
> http://www.mwt2.org/~sarah/condor/
> 
> On Saturday, I got an email from condor on the manager saying that the
> condor_negotiator was killed because it was unresponsive.  The email
> says the last lines of the NegotiatorLog were:
> 03/03/12 16:32:19 ---------- Started Negotiation Cycle ----------
> 03/03/12 16:32:19 Phase 1:  Obtaining ads from collector ...
> 03/03/12 16:32:19   Getting all public ads ...
> 03/03/12 16:32:33   Sorting 6473 ads ...
> 03/03/12 16:32:33   Getting startd private ads ...
> 03/03/12 17:12:05 Got ads: 6473 public and 5941 private
> 03/03/12 17:12:05 Public ads include 8 submitter, 5941 startd
> 03/03/12 17:12:25 Phase 2:  Performing accounting ...
> 
> The email was sent at 17:22, which matches the time the MasterLog says
> it killed the negotiator process.  I don't see any messages on the
> worker nodes saying that anything went into Preempting/Killing, so I
> assume this event did not kill any jobs.
> 
> --Sarah

Follow-Ups:
- Re: [Condor-users] sched goess offline, kills jobs
  - From: Rita

References:
- [Condor-users] sched goess offline, kills jobs
  - From: Sarah Williams

Prev by Date: [Condor-users] sched goess offline, kills jobs
Next by Date: [Condor-users] [ASK] Facing problems on running mpi under condor
Previous by thread: [Condor-users] sched goess offline, kills jobs
Next by thread: Re: [Condor-users] sched goess offline, kills jobs
Index(es):
- Date
- Thread