[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] condor_dagman not creating jobs



Hello

I've run into an issue where dagman seems to be unable to create jobs because condor_submit segfaults.

.condor_dagman.out contains:
10/27/21 12:52:35 ERROR: submit attempt failed
10/27/21 12:52:35 submit command was: /usr/bin/condor_submit -a dag_node_name' '=' 'job2 -a submit_event_notes' '=' 'DAG' 'Node:' 'job2 -a dagman_log' '=' '/mnt/scratch/tyuan/refit/./refit.prob.dag.nodes.log -a +DAGManNodesMask' '=' '"0,1,2,4,5,7,9,10,11,12,13,16,17,24,27,35,36" -a JOB=job2 -a OUTPUT_DIR' '=' '/data/user/tyuan/studies/tablemaker/refits/prob -a INPUT_DIR' '=' '/data/user/chill/photo-table -a FILE_NAME' '=' 'cascade_halftable_spice_3.2.1_flat_z0_zen100_azi180_nevents40000_0_range.fits -a DAG_STATUS' '=' '2 -a FAILED_COUNT' '=' '1 -a notification' '=' 'never -a +DAGParentNodeNames' '=' '"" refit.prob.sub 10/27/21 12:52:35 Job submit try 1/6 failed, will try again in >= 1 second.

dmesg contains:
[2335469.858471] condor_submit[2260162]: segfault at a ip 00007efd3f70e2cb sp 00007ffd24306b40 error 4 in libglobus_gsi_credential.so.1.6.14[7efd3f707000+9000] [2335469.864387] Code: 00 48 c7 44 24 08 00 00 00 00 48 85 ff 74 07 e8 9b 93 ff ff 89 c5 4d 85 ff 74 3f 4c 8d 6c 24 08 49 8b 07 4c 89 ee 48 8b 40 20 <48> 8b 78 08 e8 bc 92 ff ff 85 c0 75 78 48 8b 03 48 8b 54 24 08 48

We are running version 9.0.6 on Centos 8.

My simple test dags seem to be fine, so it doesn't always fail. Perhaps it has something to do with sending x509 proxies with the jobs?

Any help would be appreciated.


Vlad