Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Condor SOAP hanging schedd.
- Date: Thu, 24 Jun 2010 08:39:29 -0400
- From: Matthew Farrellee <matt@xxxxxxxxxx>
- Subject: Re: [Condor-users] Condor SOAP hanging schedd.
Please capture some pstack output from the schedd when it is hung and report back.
(as root)
while [ 1 ]; do date; pstack $(pidof condor_schedd); sleep 3; done | tee schedd.$(pidof condor_schedd).pstack
Best,
matt
On 06/21/2010 02:19 PM, Patrick Armstrong wrote:
> Has anyone else ever seen this problem? Is there any more information I
> can provide?
>
>
>
> On 16-Jun-10, at 11:41 AM, Patrick Armstrong wrote:
>
>> I've been having some trouble with condor soap queries hanging my
>> schedd. I have Condor 7.5.2 installed, with a pool of about 200
>> workers, and about 10000 jobs in my queue, and every ten minutes or
>> so, a script of mine is querying the schedd with the soap interface.
>> Normally, this takes about two minutes, and looks like this in the log:
>>
>> 06/16/10 10:39:51 Received HTTP POST connection from <127.0.0.1:59318>
>> 06/16/10 10:39:51 Current Socket bufsize=85k
>> 06/16/10 10:39:51 Current Socket bufsize=49k
>> 06/16/10 10:39:51 About to serve HTTP request...
>> 06/16/10 10:39:51 SOAP entered getJobAds(), transaction: 0
>> 06/16/10 10:39:53 SOAP leaving getJobAds() result=0
>> 06/16/10 10:41:20 Completed servicing HTTP request
>>
>>
>> However, I'll occasionally see the schedd get stuck, and not do
>> anything until I send it SIGKILL. The log looks like this:
>>
>>
>> [root@canfarpool ~]# tail /var/log/condor/SchedLog
>> 06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
>> <142.104.63.28:48906>, access level DAEMON
>> 06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
>> <142.104.63.28:48906>, access level DAEMON
>> 06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
>> <142.104.63.28:48906>, access level DAEMON
>> 06/16/10 10:56:20 Received UDP command 60008 (DC_CHILDALIVE) from
>> <142.104.63.28:48906>, access level DAEMON
>> 06/16/10 10:58:11 Received HTTP POST connection from <127.0.0.1:34416>
>> 06/16/10 10:58:11 Current Socket bufsize=85k
>> 06/16/10 10:58:11 Current Socket bufsize=49k
>> 06/16/10 10:58:11 About to serve HTTP request...
>> 06/16/10 10:58:11 SOAP entered getJobAds(), transaction: 0
>> 06/16/10 10:58:14 SOAP leaving getJobAds() result=0
>> [root@canfarpool ~]# date
>> Wed Jun 16 11:39:57 PDT 2010
>>
>> As you can see, it's been stuck for about 40 minutes.
>>
>>
>> Has anyone else run into this?
>>
>> --patrick
>>
>> _______________________________________________
>> Condor-users mailing list
>> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/condor-users/
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/