[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman Caught signal 11



Hi Cole,
DAGMAN_USE_DIRECT_SUBMIT = False
Solved the issue.

Thanks,
David



On Tue, Mar 5, 2024 at 5:00âPM Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hi David,

Appears DAGMan is crashing when submitting one of the jobs it manages. It would be nice to know if the crash is an issue with our Submit Utils and possibly the job submit description or an issue with how DAGMan submits jobs directly. If you set the configuration option DAGMAN_USE_DIRECT_SUBMITÂ= False does the issue still occur?

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Tuesday, March 5, 2024 8:51 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] dagman Caught signal 11
Â
Hi,
As part of WLCG migration away from GSI we upgraded condor from 9.0 toÂcondor-10.9.0-1.el7.x86_64.
Since the upgrade, dag jobs started failing with Caught signal 11, in random stages.


03/05/24 16:43:29 Submitting node 1_rRNA.condor from file 1_rRNA.condor using direct job submission
Caught signal 11: si_code=1, si_pid=4294967280, si_uid=0, si_addr=0xFFFFFFF0
Stack dump for process 3749879 at timestamp 1709649809 (14 frames)
/lib64/libcondor_utils_10_9_0.so(_Z18dprintf_dump_stackv+0x25)[0x7f0cb6c3d595]
/lib64/libcondor_utils_10_9_0.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x68)[0x7f0cb6e3f168]
/lib64/libpthread.so.0(+0xf630)[0x7f0cb5082630]
/lib64/libc.so.6(+0x16f8e5)[0x7f0cb4e148e5]
condor_scheduniv_exec.1729273.0(_ZN19SubmitStepFromQArgs12next_rowdataEv+0x161)[0x436df1]
condor_scheduniv_exec.1729273.0(_Z20direct_condor_submitRK6DagmanP3JobPKcRKSsS5_S5_R8CondorID+0xbfc)[0x434f0c]
condor_scheduniv_exec.1729273.0(_ZN3Dag13SubmitNodeJobERK6DagmanP3JobR8CondorID+0x1a9)[0x41fc19]
condor_scheduniv_exec.1729273.0(_ZN3Dag15SubmitReadyJobsERK6Dagman+0x230)[0x420500]
condor_scheduniv_exec.1729273.0(_Z18condor_event_timerv+0xf2)[0x428462]
/lib64/libcondor_utils_10_9_0.so(_ZN12TimerManager7TimeoutEPiPd+0x473)[0x7f0cb6e609a3]
/lib64/libcondor_utils_10_9_0.so(_ZN10DaemonCore6DriverEv+0x25f)[0x7f0cb6e2737f]
/lib64/libcondor_utils_10_9_0.so(_Z7dc_mainiPPc+0x18f4)[0x7f0cb6e4bb84]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0cb4cc7555]
condor_scheduniv_exec.1729273.0[0x411874]

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/