Dear all, When upgrading one of our systems running RHEL 8 to HTCondor 9.0.0 (previously we were running 8.9.11 without any problems) we encountered the condor_master terminating after receiving a SIGABRT. Based on the logs [1] this seems to be related to using token authentication and condor trying to request a token from the host running the collector. The machine where we saw this behavior is a worker node running a STARTD and having access to a token with ADVERTISE_STARTD permissions. We were able to reproduce this behavior on a test machine (here we were only able to use CentOS 8 not RHEL 8) and were able to trace it down to the following backtrace [2], which points to [3] as the place in HTCondor where the abort is triggered. Since this does not seem to related to the specific setup of our hosts, has anyone encountered a similar issue? Thanks, Rene [1] 04/26/21 12:15:00 (pid:1293) (D_SECURITY) Trying token request to remote host cloud-htcondor.gridka.de for user (default). Caught signal 6: si_code=4294967290, si_pid=1293, si_uid=232883, si_addr=0x50D Stack dump for process 1293 at timestamp 1619432100 (13 frames) /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(dprintf_dump_stack+0x28)[0x147ceb85baf8] /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x6d)[0x147ceba8686d] /lib64/libpthread.so.0(+0x12dd0)[0x147ce9928dd0] /lib64/libc.so.6(gsignal+0x10f)[0x147ce958b70f] /lib64/libc.so.6(abort+0x127)[0x147ce9575b25] /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_ZN8htcondor18generate_client_idB5cxx11Ev+0x87)[0x147ceb998157] /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(+0x3d10b8)[0x147ceba8d0b8] /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(+0x3d18af)[0x147ceba8d8af] /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_ZN12TimerManager7TimeoutEPiPd+0x3a3)[0x147cebaa1f13] /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_ZN10DaemonCore6DriverEv+0x788)[0x147ceba72ed8] /hkfs/home/project/hk-project-test-hep/scc-sdm-hep-0001/software/condor/condor-9.0.0-1-x86_64_CentOS8-stripped/usr/sbin/../lib64/libcondor_utils_9_0_0.so(_Z7dc_mainiPPc+0x1890)[0x147ceba8b4d0] /lib64/libc.so.6(__libc_start_main+0xf3)[0x147ce95776a3] condor_master(_start+0x2e)[0x558f85cedb4e] [2] #0 0x00007ffff557499f in raise () from /usr/lib64/libc.so.6 #1 0x00007ffff555ecf5 in abort () from /usr/lib64/libc.so.6 #2 0x00007ffff7979157 in std::__replacement_assert (__condition=0x7ffff7a965a8 "__builtin_expect(__n < this->size(), true)", __function=<synthetic pointer>, __line=932,  __file=0x7ffff7a965d8 "/usr/include/c++/8/bits/stl_vector.h") at /usr/include/c++/8/x86_64-redhat-linux/bits/c++config.h:2391 #3 std::vector<char, std::allocator<char> >::operator[] (__n=0, this=<synthetic pointer>) at /usr/include/c++/8/bits/stl_vector.h:932 #4 htcondor::generate_client_id[abi:cxx11]() () at /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_utils/token_utils.cpp:102 #5 0x00007ffff7a6e0b8 in (anonymous namespace)::TokenRequest::tryTokenRequest (req=...) at /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:462 #6 0x00007ffff7a6e8af in (anonymous namespace)::TokenRequest::tryTokenRequests () at /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:422 #7 0x00007ffff7a82f13 in TimerManager::Timeout (this=0x55555579f290, pNumFired=pNumFired@entry=0x7fffffffdbf4, pruntime=pruntime@entry=0x7fffffffdbf8)  at /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/timer_manager.cpp:473 #8 0x00007ffff7a53ed8 in DaemonCore::Driver (this=0x5555557a03b0) at /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/daemon_core.cpp:3513 #9 0x00007ffff7a6c4d0 in dc_main (argc=1, argv=<optimized out>) at /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_daemon_core.V6/daemon_core_main.cpp:4386 #10 0x00007ffff5560873 in __libc_start_main () from /usr/lib64/libc.so.6 #11 0x0000555555560b4e in _start () at /usr/src/debug/condor-9.0.0-1.el8.x86_64/src/condor_utils/dc_service.h:70 [3] https://github.com/htcondor/htcondor/blob/V9_0_0/src/condor_utils/token_utils.cpp#L102 -- Karlsruher Institut fÃr Technologie (KIT) Steinbuch Centre for Computing (SCC) Dr. Renà Caspart Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen, Germany Telefon: +49 721 608-25631 E-mail: Rene.Caspart@xxxxxxx Sitz der KÃrperschaft: KaiserstraÃe 12, 76131 Karlsruhe KIT â Die ForschungsuniversitÃt in der Helmholtz-Gemeinschaft
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature