[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman Caught signal 11



Hi David,

Appears DAGMan is crashing when submitting one of the jobs it manages. It would be nice to know if the crash is an issue with our Submit Utils and possibly the job submit description or an issue with how DAGMan submits jobs directly. If you set the configuration option DAGMAN_USE_DIRECT_SUBMIT = False does the issue still occur?

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Tuesday, March 5, 2024 8:51 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] dagman Caught signal 11
 
Hi,
As part of WLCG migration away from GSI we upgraded condor from 9.0 to condor-10.9.0-1.el7.x86_64.
Since the upgrade, dag jobs started failing with Caught signal 11, in random stages.


03/05/24 16:43:29 Submitting node 1_rRNA.condor from file 1_rRNA.condor using direct job submission
Caught signal 11: si_code=1, si_pid=4294967280, si_uid=0, si_addr=0xFFFFFFF0
Stack dump for process 3749879 at timestamp 1709649809 (14 frames)
/lib64/libcondor_utils_10_9_0.so(_Z18dprintf_dump_stackv+0x25)[0x7f0cb6c3d595]
/lib64/libcondor_utils_10_9_0.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x68)[0x7f0cb6e3f168]
/lib64/libpthread.so.0(+0xf630)[0x7f0cb5082630]
/lib64/libc.so.6(+0x16f8e5)[0x7f0cb4e148e5]
condor_scheduniv_exec.1729273.0(_ZN19SubmitStepFromQArgs12next_rowdataEv+0x161)[0x436df1]
condor_scheduniv_exec.1729273.0(_Z20direct_condor_submitRK6DagmanP3JobPKcRKSsS5_S5_R8CondorID+0xbfc)[0x434f0c]
condor_scheduniv_exec.1729273.0(_ZN3Dag13SubmitNodeJobERK6DagmanP3JobR8CondorID+0x1a9)[0x41fc19]
condor_scheduniv_exec.1729273.0(_ZN3Dag15SubmitReadyJobsERK6Dagman+0x230)[0x420500]
condor_scheduniv_exec.1729273.0(_Z18condor_event_timerv+0xf2)[0x428462]
/lib64/libcondor_utils_10_9_0.so(_ZN12TimerManager7TimeoutEPiPd+0x473)[0x7f0cb6e609a3]
/lib64/libcondor_utils_10_9_0.so(_ZN10DaemonCore6DriverEv+0x25f)[0x7f0cb6e2737f]
/lib64/libcondor_utils_10_9_0.so(_Z7dc_mainiPPc+0x18f4)[0x7f0cb6e4bb84]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0cb4cc7555]
condor_scheduniv_exec.1729273.0[0x411874]