Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor_status stuck
- Date: Thu, 27 Mar 2008 10:35:00 -0500
- From: "Pleat, Andrew C." <andrew.pleat@xxxxxxx>
- Subject: Re: [Condor-users] condor_status stuck
- condor_status on central manager is hanging
- condor_status is hanging on other machines as well
- CollectorLog
- lots of apparently normal messages up until 10:30 and then
silence
- only unusual message is at 10:17:
- can't send UPDATE_COLLECTOR_AD to collector ((nul):
Failed to send UDP update command to collector
- Housekeeper: Ready to clean old ads
- <bunch of 'Cleaning' messages>
- then resume normal messages up until 10:30 silence
- condor_status eventually failed (tens of minutes later):
- SECMAN:2003:TCP connection to <... : 9618> failed
- subsequently CollectorLog shows:
- condor_collector (CONDOR_COLLECTOR) STARTING UP
- this must be the master restarting it (as Steve Timm
indicated)
- reissued 'condor_status' - again stuck
- MasterLog
- at 11:25 shows:
- NEGOTIATOR recovered
- COLLECTOR recovered
- SCHEDD recovered
- the 'condor_restart -subsystem schedd' that I issued initially final
went through (although now I now understand it wasn't the likely
culprit)
- reissued 'condor_q' and same result : Failed to fetch ads ... : 9679
- note the port changed
thanks for the responses
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Todd Tannenbaum
Sent: Thursday, March 27, 2008 11:15 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] condor_status stuck
Hi Andrew -
Based upon your clues below, everything points to the condor_collector
process not responding. What does the CollectorLog on your central
manager machine have to say for itself? Can you run "condor_status"
on your central manager?
thanks,
Todd
Pleat, Andrew C. wrote:
>
>
> Condor 6.8.5
>
> Occasionally, there's some sort of lock-up occuring in my cluster.
> The symptoms are:
>
> - condor_status hangs indefinitely
> - condor_q hangs for about a minute and returns 'Failed to fetch ads
> from: <... : 9683> : ..'
> - condor_restart -subsystem schedd hangs
> - I tried this based on looking at condor_users mail
> - condor processes still running (although no apparent activity)
>
> Logs:
> - MasterLog shows normal activity
> - NegotiatorLog seems to have stopped reporting
> - normally it writes messages every 5 minutes
> - the last report was "Getting all public ads ..."
> - SchedLog reports 'Called reschedule_negotiator()' as last message
> - a condor_submit_dag had been performed in the same time
frame
> - normally, the next message is "Activity on stashed
> negotiator socket"
> - StartLog has nothing special (although file is still being touched)
> - the only other file still being touched is MasterLog
>
> My conclusion would be the negotiator is somehow stuck.
>
> any ideas
>
> thank you
> andy pleat
>
>
>
>
> ----------------------------------------------------------------------
> --
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
> with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/
--
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tannenba@xxxxxxxxxxx 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/