On 3/5/12 3:12 PM, Sarah Williams wrote:
> Hello,
>
> We saw an issue on our condor installation on Friday afternoon that
> killed all jobs in the cluster. Details are below. I'm looking to find
> out what happened, why it killed jobs, and how to keep it from happening
> again.
>
> The first symptom was that some of our monitoring software started to
> hang, because condor_q was hanging on
> queries against condor_sched.
>
> The NegiatorLog has several messages like:
> 03/02/12 16:23:24 condor_read(): timeout reading 5 bytes from schedd
>
group_atlasprod.usatlas1@xxxxxxxxxxxxxxx.
> 03/02/12 16:23:24 IO: Failed to read packet header
> 03/02/12 16:23:24 Failed to get reply from schedd
> 03/02/12 16:23:24 Error: Ignoring submitter for this cycle
> 03/02/12 16:23:24 negotiateWithGroup resources used scheddAds length
>
> Finally, condor_q started failing all-together with the message:
> Error: Collector has no record of schedd/submitter
>
> At that point, I restarted condor on the gatekeeper, which runs
> condor_master and schedd. I've previously restarted condor on the
> gatekeeper, and even rebooted it, without dropping jobs. However, this
> time it didn't work that way. In worker-node logs I see messages like:
>
> 03/02/12 16:14:08 slot16: Failed to connect to schedd
> <
128.135.158.146:39156>
> 03/02/12 16:14:11 slot8: State change: claim lease expired
> (condor_schedd gone?)
> 03/02/12 16:14:11 slot8: Changing state and activity: Claimed/Busy ->
> Preempting/Killing
>
> The SchedLog on osg-gk, oddly, shows nothing unusual during this time.
>
> I put a snapshot of the log files up here, in case anyone wants to
> browse them:
>
http://www.mwt2.org/~sarah/condor/
>
> On Saturday, I got an email from condor on the manager saying that the
> condor_negotiator was killed because it was unresponsive. The email
> says the last lines of the NegotiatorLog were:
> 03/03/12 16:32:19 ---------- Started Negotiation Cycle ----------
> 03/03/12 16:32:19 Phase 1: Obtaining ads from collector ...
> 03/03/12 16:32:19 Getting all public ads ...
> 03/03/12 16:32:33 Sorting 6473 ads ...
> 03/03/12 16:32:33 Getting startd private ads ...
> 03/03/12 17:12:05 Got ads: 6473 public and 5941 private
> 03/03/12 17:12:05 Public ads include 8 submitter, 5941 startd
> 03/03/12 17:12:25 Phase 2: Performing accounting ...
>
> The email was sent at 17:22, which matches the time the MasterLog says
> it killed the negotiator process. I don't see any messages on the
> worker nodes saying that anything went into Preempting/Killing, so I
> assume this event did not kill any jobs.
>
> --Sarah
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to
condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/