Re: [HTCondor-devel] 8.9.8 master getting into an infinite loop on startup


Date: Tue, 21 Jul 2020 14:25:52 -0500
From: MÃtyÃs Selmeci <matyas@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] 8.9.8 master getting into an infinite loop on startup
What's the highest value?  It's not unusable at 8M, it just starts getting slower and slower the higher
you go.  16M is the boundary of my patience but I'd love to see Condor detect that ;)

-Mat

On 7/21/20 2:22 PM, Bockelman, Brian wrote:
> Hi,
> 
> Wasnât Jaime looking at this in terms of using /proc/self/fd to determine the highest value?
> 
> Brian
> 
> Sent from my iPhone
> 
>> On Jul 21, 2020, at 2:19 PM, MÃtyÃs Selmeci <matyas@xxxxxxxxxxx> wrote:
>>
>> ïBe careful with that -- I was seeing slow startup times and general lack of responsiveness at 8 million.
>> 2 million seems fine.
>>
>> -Mat
>>
>>
>>> On 7/21/20 1:12 PM, Tim Theisen wrote:
>>> Thank you for figuring that out. I guess I just include a big number
>>> rather than infinity in the systemd .service file.
>>>
>>> Infinity should have worked everywhere.
>>>
>>> ...Tim
>>>
>>>> On 7/21/20 1:05 PM, MÃtyÃs Selmeci via HTCondor-devel wrote:
>>>> If I run it by hand instead of via systemd, everything works fine.
>>>>
>>>> If I edit the .service file and change LimitNOFILE=infinity to LimitNOFILE=524288, everything works fine.
>>>>
>>>> Interestingly, in a root shell, I get "operation not permitted" when trying to do `ulimit -n unlimited`.
>>>> The highest value I can set it to is 1073741816, which is the value of the fs.nr_open sysctl.
>>>>
>>>> -Mat
>>>>
>>>>
>>>> On 7/21/20 12:45 PM, Bockelman, Brian wrote:
>>>>> Ohh - is that in the middle of "close() all FDs possible" code?
>>>>>
>>>>> Does "strace" show a lot of close() followed by EBADF?  What's the process limit on FDs?
>>>>>
>>>>> Brian
>>>>>
>>>>>> On Jul 21, 2020, at 12:41 PM, MÃtyÃs Selmeci <matyas@xxxxxxxxxxx> wrote:
>>>>>>
>>>>>> Here's the pstack of the child:
>>>>>>
>>>>>>
>>>>>> #0  0x00007fd5c788fa17 in close () from /usr/lib64/libpthread.so.0
>>>>>> #1  0x00007fd5c83eb03e in CreateProcessForkit::exec() () from /usr/lib64/libcondor_utils_8_9_8.so
>>>>>> #2  0x00007fd5c83eb89c in CreateProcessForkit::fork_exec() () from /usr/lib64/libcondor_utils_8_9_8.so
>>>>>> #3  0x00007fd5c83f85cb in DaemonCore::Create_Process(char const*, ArgList const&, priv_state, int, int, int, Env const*, char const*, FamilyInfo*, Stream**, int*, int*, int, __sigset_t*, int, unsigned long*, int*, char const*, MyString*, FilesystemRemap*, long) () from /usr/lib64/libcondor_utils_8_9_8.so
>>>>>> #4  0x00007fd5c82da64b in ProcFamilyProxy::start_procd() () from /usr/lib64/libcondor_utils_8_9_8.so
>>>>>> #5  0x00007fd5c82db283 in ProcFamilyProxy::ProcFamilyProxy(char const*) () from /usr/lib64/libcondor_utils_8_9_8.so
>>>>>> #6  0x00007fd5c82d9e18 in ProcFamilyInterface::create(char const*) () from /usr/lib64/libcondor_utils_8_9_8.so
>>>>>> #7  0x00007fd5c83f9236 in DaemonCore::Create_Process(char const*, ArgList const&, priv_state, int, int, int, Env const*, char const*, FamilyInfo*, Stream**, int*, int*, int, __sigset_t*, int, unsigned long*, int*, char const*, MyString*, FilesystemRemap*, long) () from /usr/lib64/libcondor_utils_8_9_8.so
>>>>>> #8  0x0000000000416315 in daemon::RealStart() ()
>>>>>> #9  0x0000000000416f3a in Daemons::StartDaemonHere(daemon*) ()
>>>>>> #10 0x0000000000416fe3 in Daemons::StartAllDaemons() ()
>>>>>> #11 0x000000000040ebbe in main_init(int, char**) ()
>>>>>> #12 0x00007fd5c8403468 in dc_main(int, char**) () from /usr/lib64/libcondor_utils_8_9_8.so
>>>>>> #13 0x00007fd5c76d9042 in __libc_start_main () from /usr/lib64/libc.so.6
>>>>>> #14 0x000000000040b90e in _start ()
>>>>>>
>>>>>>
>>>>>> On 7/21/20 12:34 PM, Bockelman, Brian wrote:
>>>>>>> Hi Mat,
>>>>>>>
>>>>>>> Could you do a "pstack" of the child condor_master process?
>>>>>>>
>>>>>>> Unfortunately, from your traceback, it like the master is simply waiting for the child to do something (either exec or error out) -- not too much info there.
>>>>>>>
>>>>>>> Brian
>>>>>>>
>>>>>>>> On Jul 21, 2020, at 12:05 PM, MÃtyÃs Selmeci via HTCondor-devel <htcondor-devel@xxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> I've got a problem running 8.9.8 on my Fedora 32 laptop (I'm using an
>>>>>>>> RPM Tim gave me from an NMI build): when I start condor, the master
>>>>>>>> forks and the child master gets into an infinite loop, eating an entire
>>>>>>>> CPU and not responding to SIGTERM.  The last line in the MasterLog is:
>>>>>>>>
>>>>>>>> 07/21/20 11:46:56 (fd:1) (pid:233863) (D_DAEMONCORE) About to exec "/usr/sbin/condor_procd"
>>>>>>>>
>>>>>>>> SELinux is off.  I attached my MasterLog with D_ALL:2 and
>>>>>>>> condor_config_val -summary (that feature's great).  The traceback
>>>>>>>> at the end of MasterLog is me killing sending SIGABRT to both
>>>>>>>> condor_master processes.
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Mat
>>>>>>>> <MasterLog.txt><summary.txt>_______________________________________________
[← Prev in Thread] Current Thread [Next in Thread→]