Hi David,
I'm glad that not using direct submit prevented the issue, but this is definitely a DAGMan bug. Would you be willing to send the submit description of the particular job that DAGMan crashed on and the queue item data if it is in an external file? You can send
this directly offline to me.
-Cole Bollig
From: David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Monday, March 11, 2024 5:05 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Re: [HTCondor-users] dagman Caught signal 11
Hi Cole,
DAGMAN_USE_DIRECT_SUBMIT = False
Solved the issue.
Thanks,
David
Hi David,
Appears DAGMan is crashing when submitting one of the jobs it manages. It would be nice to know if the crash is an issue with our Submit Utils and possibly the job submit description or an issue with how DAGMan submits jobs directly. If you set the configuration
option
DAGMAN_USE_DIRECT_SUBMIT = False does the issue still occur?
-Cole Bollig
Hi,
As part of WLCG migration away from GSI we upgraded condor from 9.0 to condor-10.9.0-1.el7.x86_64.
Since the upgrade, dag jobs started failing with Caught signal 11, in random stages.
03/05/24 16:43:29 Submitting node 1_rRNA.condor from file 1_rRNA.condor using direct job submission
Caught signal 11: si_code=1, si_pid=4294967280, si_uid=0, si_addr=0xFFFFFFF0
Stack dump for process 3749879 at timestamp 1709649809 (14 frames)
/lib64/libcondor_utils_10_9_0.so(_Z18dprintf_dump_stackv+0x25)[0x7f0cb6c3d595]
/lib64/libcondor_utils_10_9_0.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x68)[0x7f0cb6e3f168]
/lib64/libpthread.so.0(+0xf630)[0x7f0cb5082630]
/lib64/libc.so.6(+0x16f8e5)[0x7f0cb4e148e5]
condor_scheduniv_exec.1729273.0(_ZN19SubmitStepFromQArgs12next_rowdataEv+0x161)[0x436df1]
condor_scheduniv_exec.1729273.0(_Z20direct_condor_submitRK6DagmanP3JobPKcRKSsS5_S5_R8CondorID+0xbfc)[0x434f0c]
condor_scheduniv_exec.1729273.0(_ZN3Dag13SubmitNodeJobERK6DagmanP3JobR8CondorID+0x1a9)[0x41fc19]
condor_scheduniv_exec.1729273.0(_ZN3Dag15SubmitReadyJobsERK6Dagman+0x230)[0x420500]
condor_scheduniv_exec.1729273.0(_Z18condor_event_timerv+0xf2)[0x428462]
/lib64/libcondor_utils_10_9_0.so(_ZN12TimerManager7TimeoutEPiPd+0x473)[0x7f0cb6e609a3]
/lib64/libcondor_utils_10_9_0.so(_ZN10DaemonCore6DriverEv+0x25f)[0x7f0cb6e2737f]
/lib64/libcondor_utils_10_9_0.so(_Z7dc_mainiPPc+0x18f4)[0x7f0cb6e4bb84]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0cb4cc7555]
condor_scheduniv_exec.1729273.0[0x411874]
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
|