[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dagman Caught signal 11



Hi David,

I'm glad that not using direct submit prevented the issue, but this is definitely a DAGMan bug. Would you be willing to send the submit description of the particular job that DAGMan crashed on and the queue item data if it is in an external file? You can send this directly offline to me.

-Cole Bollig

From: David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Monday, March 11, 2024 5:05 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Cole Bollig <cabollig@xxxxxxxx>
Subject: Re: [HTCondor-users] dagman Caught signal 11
 
Hi Cole,
DAGMAN_USE_DIRECT_SUBMIT = False
Solved the issue.

Thanks,
David



On Tue, Mar 5, 2024 at 5:00âPM Cole Bollig via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Hi David,

Appears DAGMan is crashing when submitting one of the jobs it manages. It would be nice to know if the crash is an issue with our Submit Utils and possibly the job submit description or an issue with how DAGMan submits jobs directly. If you set the configuration option DAGMAN_USE_DIRECT_SUBMIT = False does the issue still occur?

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of David Cohen <cdavid@xxxxxxxxxxxxxxxxxxxxxx>
Sent: Tuesday, March 5, 2024 8:51 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] dagman Caught signal 11
 
Hi,
As part of WLCG migration away from GSI we upgraded condor from 9.0 to condor-10.9.0-1.el7.x86_64.
Since the upgrade, dag jobs started failing with Caught signal 11, in random stages.


03/05/24 16:43:29 Submitting node 1_rRNA.condor from file 1_rRNA.condor using direct job submission
Caught signal 11: si_code=1, si_pid=4294967280, si_uid=0, si_addr=0xFFFFFFF0
Stack dump for process 3749879 at timestamp 1709649809 (14 frames)
/lib64/libcondor_utils_10_9_0.so(_Z18dprintf_dump_stackv+0x25)[0x7f0cb6c3d595]
/lib64/libcondor_utils_10_9_0.so(_Z17unix_sig_coredumpiP9siginfo_tPv+0x68)[0x7f0cb6e3f168]
/lib64/libpthread.so.0(+0xf630)[0x7f0cb5082630]
/lib64/libc.so.6(+0x16f8e5)[0x7f0cb4e148e5]
condor_scheduniv_exec.1729273.0(_ZN19SubmitStepFromQArgs12next_rowdataEv+0x161)[0x436df1]
condor_scheduniv_exec.1729273.0(_Z20direct_condor_submitRK6DagmanP3JobPKcRKSsS5_S5_R8CondorID+0xbfc)[0x434f0c]
condor_scheduniv_exec.1729273.0(_ZN3Dag13SubmitNodeJobERK6DagmanP3JobR8CondorID+0x1a9)[0x41fc19]
condor_scheduniv_exec.1729273.0(_ZN3Dag15SubmitReadyJobsERK6Dagman+0x230)[0x420500]
condor_scheduniv_exec.1729273.0(_Z18condor_event_timerv+0xf2)[0x428462]
/lib64/libcondor_utils_10_9_0.so(_ZN12TimerManager7TimeoutEPiPd+0x473)[0x7f0cb6e609a3]
/lib64/libcondor_utils_10_9_0.so(_ZN10DaemonCore6DriverEv+0x25f)[0x7f0cb6e2737f]
/lib64/libcondor_utils_10_9_0.so(_Z7dc_mainiPPc+0x18f4)[0x7f0cb6e4bb84]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f0cb4cc7555]
condor_scheduniv_exec.1729273.0[0x411874]

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/