We are facing weird issue with DAGman consists of 54 jobs. One of the job in DAG is randomly throwing an error at no particular frequency. I am trying to debug the reason for same.
05/08/19 04:01:16 From submit: Submitting job(s)....
05/08/19 04:01:16 From submit: ERROR: Failed submission for job 147157.4 - aborting entire submit
05/08/19 04:01:16 From submit:
05/08/19 04:01:16 From submit: ERROR: Failed to queue job.
05/08/19 04:01:16 failed while reading from pipe.
05/08/19 04:01:16 Read so far: Submitting job(s)....ERROR: Failed submission for job 147157.4 - aborting entire submitERROR: Failed to queue job.
05/08/19 04:01:16 ERROR: submit attempt failed
Shadow logs are not showing any indication for this job but it does show the "status in check_zombie" message
for another job of same Dag. Most of the time I noticed this zombie message appearing in sched logs during the time of issue but it's not everytime.
05/08/19 04:01:15 (pid:1316698) Shadow pid 1054140 for job 147147.1 reports job exit reason 100.
05/08/19 04:01:15 (pid:1316698) ERROR fetching job (147154.6) status in check_zombie !
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053429 for job 147147.4 exited with status 100
$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $
$CondorPlatform: x86_64_RedHat6 $
Anyone else saw this issue?