Re: [HTCondor-devel] 8.9.8 master getting into an infinite loop on startup


Date: Tue, 21 Jul 2020 14:24:31 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] 8.9.8 master getting into an infinite loop on startup
Also see my code review question on this at

https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=7650

regards,
Todd

p.s. people should ideally not do a code review on their own code :)


On 7/21/2020 2:22 PM, Bockelman, Brian via HTCondor-devel wrote:
Hi,

Wasnât Jaime looking at this in terms of using /proc/self/fd to determine the highest value?

Brian

Sent from my iPhone

On Jul 21, 2020, at 2:19 PM, MÃtyÃs Selmeci <matyas@xxxxxxxxxxx> wrote:

ïBe careful with that -- I was seeing slow startup times and general lack of responsiveness at 8 million.
2 million seems fine.

-Mat


On 7/21/20 1:12 PM, Tim Theisen wrote:
Thank you for figuring that out. I guess I just include a big number
rather than infinity in the systemd .service file.

Infinity should have worked everywhere.

...Tim

On 7/21/20 1:05 PM, MÃtyÃs Selmeci via HTCondor-devel wrote:
If I run it by hand instead of via systemd, everything works fine.

If I edit the .service file and change LimitNOFILE=infinity to LimitNOFILE=524288, everything works fine.

Interestingly, in a root shell, I get "operation not permitted" when trying to do `ulimit -n unlimited`.
The highest value I can set it to is 1073741816, which is the value of the fs.nr_open sysctl.

-Mat


On 7/21/20 12:45 PM, Bockelman, Brian wrote:
Ohh - is that in the middle of "close() all FDs possible" code?

Does "strace" show a lot of close() followed by EBADF?  What's the process limit on FDs?

Brian

On Jul 21, 2020, at 12:41 PM, MÃtyÃs Selmeci <matyas@xxxxxxxxxxx> wrote:

Here's the pstack of the child:


#0  0x00007fd5c788fa17 in close () from /usr/lib64/libpthread.so.0
#1  0x00007fd5c83eb03e in CreateProcessForkit::exec() () from /usr/lib64/libcondor_utils_8_9_8.so
#2  0x00007fd5c83eb89c in CreateProcessForkit::fork_exec() () from /usr/lib64/libcondor_utils_8_9_8.so
#3  0x00007fd5c83f85cb in DaemonCore::Create_Process(char const*, ArgList const&, priv_state, int, int, int, Env const*, char const*, FamilyInfo*, Stream**, int*, int*, int, __sigset_t*, int, unsigned long*, int*, char const*, MyString*, FilesystemRemap*, long) () from /usr/lib64/libcondor_utils_8_9_8.so
#4  0x00007fd5c82da64b in ProcFamilyProxy::start_procd() () from /usr/lib64/libcondor_utils_8_9_8.so
#5  0x00007fd5c82db283 in ProcFamilyProxy::ProcFamilyProxy(char const*) () from /usr/lib64/libcondor_utils_8_9_8.so
#6  0x00007fd5c82d9e18 in ProcFamilyInterface::create(char const*) () from /usr/lib64/libcondor_utils_8_9_8.so
#7  0x00007fd5c83f9236 in DaemonCore::Create_Process(char const*, ArgList const&, priv_state, int, int, int, Env const*, char const*, FamilyInfo*, Stream**, int*, int*, int, __sigset_t*, int, unsigned long*, int*, char const*, MyString*, FilesystemRemap*, long) () from /usr/lib64/libcondor_utils_8_9_8.so
#8  0x0000000000416315 in daemon::RealStart() ()
#9  0x0000000000416f3a in Daemons::StartDaemonHere(daemon*) ()
#10 0x0000000000416fe3 in Daemons::StartAllDaemons() ()
#11 0x000000000040ebbe in main_init(int, char**) ()
#12 0x00007fd5c8403468 in dc_main(int, char**) () from /usr/lib64/libcondor_utils_8_9_8.so
#13 0x00007fd5c76d9042 in __libc_start_main () from /usr/lib64/libc.so.6
#14 0x000000000040b90e in _start ()


On 7/21/20 12:34 PM, Bockelman, Brian wrote:
Hi Mat,

Could you do a "pstack" of the child condor_master process?

Unfortunately, from your traceback, it like the master is simply waiting for the child to do something (either exec or error out) -- not too much info there.

Brian

On Jul 21, 2020, at 12:05 PM, MÃtyÃs Selmeci via HTCondor-devel <htcondor-devel@xxxxxxxxxxx> wrote:

Hey folks,

I've got a problem running 8.9.8 on my Fedora 32 laptop (I'm using an
RPM Tim gave me from an NMI build): when I start condor, the master
forks and the child master gets into an infinite loop, eating an entire
CPU and not responding to SIGTERM.  The last line in the MasterLog is:

07/21/20 11:46:56 (fd:1) (pid:233863) (D_DAEMONCORE) About to exec "/usr/sbin/condor_procd"

SELinux is off.  I attached my MasterLog with D_ALL:2 and
condor_config_val -summary (that feature's great).  The traceback
at the end of MasterLog is me killing sending SIGABRT to both
condor_master processes.

Any ideas?

Thanks,
-Mat
<MasterLog.txt><summary.txt>_______________________________________________
_______________________________________________
HTCondor-devel mailing list
HTCondor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel


--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685

[← Prev in Thread] Current Thread [Next in Thread→]