[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor daemons dying on SL7 worker nodes



Hi Andrew,

Admittedly I donât have any SL7 8.5.5 worker nodes to hand yet apart from a couple of test Centos7 nodes and they havenât been loaded heavily.

What about as a test changing the WatchdogSec value in the condor unit file, to something really high?

That might give a better picture in terms of order of events.

Cheers, Iain

> On Jul 27, 2016, at 21:59, andrew.lahiff@xxxxxxxxxx wrote:
> 
> Hi,
> 
> I'm experiencing problems with HTCondor 8.5.5 on SL7 running only Docker universe jobs. Occasionally all HTCondor daemons on a worker node do a stack dump and die, including the master, startd and all starters.
> 
> An interesting side effect of this is that while HTCondor deletes the job sandboxes, the Docker containers actually continue running, but HTCondor seems unaware of this, and therefore eventually starts running another set of jobs. So I end up with twice as many jobs running as there should be on an affected worker node, half of which are no longer under HTCondor's control.
> 
> In /var/log/messages is this (it sometimes happens several times consecutively):
> 
> 2016-07-27T20:20:09.881844+01:00 lcg1879 systemd: condor.service watchdog timeout (limit 5s)!
> 2016-07-27T20:20:10.068139+01:00 lcg1879 systemd: condor.service: main process exited, code=killed, status=6/ABRT
> 2016-07-27T20:20:10.157858+01:00 lcg1879 systemd: Unit condor.service entered failed state.
> 2016-07-27T20:20:10.158120+01:00 lcg1879 systemd: condor.service failed.
> 2016-07-27T20:20:15.381427+01:00 lcg1879 systemd: condor.service holdoff time over, scheduling restart.
> 
> In /var/log/condor/StartLog is this:
> 
> Stack dump for process 27243 at timestamp 1469647209 (9 frames)
> /lib64/libcondor_utils_8_5_5.so(dprintf_dump_stack+0x72)[0x7fe4f2118722]
> /lib64/libcondor_utils_8_5_5.so(_Z18linux_sig_coredumpi+0x24)[0x7fe4f2276224]
> /lib64/libpthread.so.0(+0xf100)[0x7fe4f0ab9100]
> /lib64/libc.so.6(__select+0x13)[0x7fe4f07d6993]
> /lib64/libcondor_utils_8_5_5.so(_ZN8Selector7executeEv+0xa6)[0x7fe4f2183056]
> /lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6DriverEv+0x1052)[0x7fe4f2259ef2]
> /lib64/libcondor_utils_8_5_5.so(_Z7dc_mainiPPc+0x13a4)[0x7fe4f2279824]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fe4f070ab15]
> condor_startd[0x422a79]
> 
> while every StarterLog has something like this:
> 
> Stack dump for process 1211947 at timestamp 1469647209 (9 frames)
> /lib64/libcondor_utils_8_5_5.so(dprintf_dump_stack+0x72)[0x7f94ab25d722]
> /lib64/libcondor_utils_8_5_5.so(_Z18linux_sig_coredumpi+0x24)[0x7f94ab3bb224]
> /lib64/libpthread.so.0(+0xf100)[0x7f94a9bfe100]
> /lib64/libc.so.6(__select+0x13)[0x7f94a991b993]
> /lib64/libcondor_utils_8_5_5.so(_ZN8Selector7executeEv+0xa6)[0x7f94ab2c8056]
> /lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6DriverEv+0x1052)[0x7f94ab39eef2]
> /lib64/libcondor_utils_8_5_5.so(_Z7dc_mainiPPc+0x13a4)[0x7f94ab3be824]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f94a984fb15]
> condor_starter[0x4220e9]
> 
> and finally the MasterLog:
> 
> Stack dump for process 27213 at timestamp 1469647209 (29 frames)
> /lib64/libcondor_utils_8_5_5.so(dprintf_dump_stack+0x72)[0x7fb1916f0722]
> /lib64/libcondor_utils_8_5_5.so(_Z18linux_sig_coredumpi+0x24)[0x7fb19184e224]
> /lib64/libpthread.so.0(+0xf100)[0x7fb190091100]
> /lib64/libc.so.6(__poll+0x10)[0x7fb18fdacc20]
> /lib64/libresolv.so.2(+0xade7)[0x7fb191b5ade7]
> /lib64/libresolv.so.2(__libc_res_nquery+0x18e)[0x7fb191b58cce]
> /lib64/libresolv.so.2(__libc_res_nsearch+0x350)[0x7fb191b598b0]
> /lib64/libnss_dns.so.2(_nss_dns_gethostbyname4_r+0x103)[0x7fb18f26fc53]
> /lib64/libc.so.6(+0xdc1a8)[0x7fb18fd9d1a8]
> /lib64/libc.so.6(getaddrinfo+0xfd)[0x7fb18fda086d]
> /lib64/libcondor_utils_8_5_5.so(_Z16ipv6_getaddrinfoPKcS0_R17addrinfo_iteratorRK8addrinfo+0x49)[0x7fb191791659]
> /lib64/libcondor_utils_8_5_5.so(_Z20resolve_hostname_rawRK8MyString+0x15a)[0x7fb1916c522a]
> /lib64/libcondor_utils_8_5_5.so(_Z16resolve_hostnameRK8MyString+0x93)[0x7fb1916c54f3]
> /lib64/libcondor_utils_8_5_5.so(_Z16resolve_hostnamePKc+0x1f)[0x7fb1916c596f]
> /lib64/libcondor_utils_8_5_5.so(_ZN8IpVerify10fill_tableEPNS_13PermTypeEntryEPcb+0x70c)[0x7fb1917c058c]
> /lib64/libcondor_utils_8_5_5.so(_ZN8IpVerify4InitEv+0x12d)[0x7fb1917c092d]
> /lib64/libcondor_utils_8_5_5.so(_ZN8IpVerify6VerifyE12DCpermissionRK15condor_sockaddrPKcP8MyStringS7_+0x2b8)[0x7fb1917c2248]
> /lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6VerifyEPKc12DCpermissionRK15condor_sockaddrS1_+0x81)[0x7fb19181fc71]
> /lib64/libcondor_utils_8_5_5.so(_ZN21DaemonCommandProtocol13VerifyCommandEv+0x45a)[0x7fb19184ac2a]
> /lib64/libcondor_utils_8_5_5.so(_ZN21DaemonCommandProtocol10doProtocolEv+0xbd)[0x7fb19184c2fd]
> /lib64/libcondor_utils_8_5_5.so(_ZN21DaemonCommandProtocol14SocketCallbackEP6Stream+0x7f)[0x7fb19184c48f]
> /lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore24CallSocketHandler_workerEibP6Stream+0x694)[0x7fb19182d634]
> /lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore35CallSocketHandler_worker_demarshallEPv+0x1d)[0x7fb19182d76d]
> /lib64/libcondor_utils_8_5_5.so(_ZN13CondorThreads8pool_addEPFvPvES0_PiPKc+0x35)[0x7fb191731865]
> /lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore17CallSocketHandlerERib+0x14a)[0x7fb191828eea]
> /lib64/libcondor_utils_8_5_5.so(_ZN10DaemonCore6DriverEv+0x1d7b)[0x7fb191832c1b]
> /lib64/libcondor_utils_8_5_5.so(_Z7dc_mainiPPc+0x13a4)[0x7fb191851824]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fb18fce2b15]
> /usr/sbin/condor_master[0x40a5c1]
> 
> Has anyone else seen this? It's not obvious to me from timestamps in the logs if it was systemd that killed all the HTCondor daemons due to the watchdog timeout (I guess it's probably this?) or if everything died first and then systemd notices. It only seems to happen when a worker node is very busy (i.e. I've never seen this happen on idle SL7 worker nodes).
> 
> Thanks,
> Andrew.
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME cryptographic signature