Thanks for your response. No failing disk on the node.ÂThere is no schedlog for 04:01:16 when the job submission was failed.Â05/08/19 04:01:15 (pid:1316698) Shadow pid 1055394 for job 147153.3 reports job exit reason 100.05/08/19 04:01:15 (pid:1316698) Shadow pid 1054140 for job 147147.1 reports job exit reason 100.05/08/19 04:01:15 (pid:1316698) ERROR fetching job (147154.6) status in check_zombie !05/08/19 04:01:15 (pid:1316698) Shadow pid 1053429 for job 147147.4 exited with status 10005/08/19 04:01:15 (pid:1316698) Shadow pid 1053482 for job 147151.3 exited with status 10005/08/19 04:01:15 (pid:1316698) Shadow pid 1056671 for job 147155.1 exited with status 10005/08/19 04:01:15 (pid:1316698) Shadow pid 1056674 for job 147155.4 exited with status 10005/08/19 04:01:15 (pid:1316698) Number of Active Workers 005/08/19 04:01:15 (pid:1316698) Number of Active Workers 005/08/19 04:01:17 (pid:1316698) Activity on stashed negotiator socket: <IPaddress:32651>05/08/19 04:01:17 (pid:1316698) Using negotiation protocol: NEGOTIATEThis is the main logic used in dagman submit file.Âremove_kill_sig = SIGUSR1+OtherJobRemoveRequirements  Â= "DAGManJobId =?= $(cluster)"# Note: default on_exit_remove _expression_:# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))# attempts to ensure that DAGMan is automatically# requeued by the schedd if it exits abnormally or# is killed (e.g., during a reboot).on_exit_remove = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))copy_to_spool Â= FalseJobs "147157.4" and "147154.6" are part of same dagman output.Â# grep -ir '147154.6' condor_20190507.20190508.033002.618997.dag.dagman.out05/08/19 03:59:28 Reassigning the id of job test_21090507_ from (147154.5.0) to (147154.6.0)05/08/19 03:59:28 Event: ULOG_SUBMIT for HTCondor Node test_21090507_ (147154.6.0) {05/08/19 03:59:27}05/08/19 03:59:28 Reassigning the id of job test_21090507_ from (147154.6.0) to (147154.7.0)05/08/19 03:59:45 Event: ULOG_EXECUTE for HTCondor Node test_21090507_ (147154.6.0) {05/08/19 03:59:28}05/08/19 04:01:16 Event: ULOG_JOB_TERMINATED for HTCondor Node test_21090507_ (147154.6.0) {05/08/19 04:00:39}05/08/19 04:01:16 Node test_21090507_ job proc (147154.6.0) completed successfully.# grep -ir '147157.4' condor_20190507.20190508.033002.618997.dag.dagman.out             Â05/08/19 04:01:16 From submit: ERROR: Failed submission for job 147157.4 - aborting entire submit05/08/19 04:01:16 Read so far: Submitting job(s)....ERROR: Failed submission for job 147157.4 - aborting entire submitERROR: Failed to queue job.Thanks & Regards,Vikrant AggarwalOn Thu, May 9, 2019 at 11:04 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:_______________________________________________Itâs still not the right time period of the SchedLog. ÂÂand the error about Zombie checking is for different job, itâs not even the same cluster.
Â
You might check to see if you have a failing disk, that *might* explain both of these problems.ÂÂ I canât think of anything else that could.
Â
-tj
Â
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 10:40 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Random job in DAGman throwing submitERRORÂ
Sorry for typo. Snippet was from schedlogs only not shadow.
On Thu, 9 May, 2019, 21:07 John M Knoeller, <johnkn@xxxxxxxxxxx> wrote:
The ShadowLog doesnât isnât where to look for this error. when submit fails, there will never be a shadow for that job.
you should look in the SchedLog at time 05/08/19 04:01:16 to see if it has a reason why the submit failed.
Â
-tj
Â
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 3:15 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Random job in DAGman throwing submitERRORÂ
Hello Team,
Â
We are facing weird issue with DAGman consists of 54 jobs. One of the job in DAG is randomly throwing an error at no particular frequency. I am trying to debug the reason for same.Â
Â
05/08/19 04:01:16 From submit: Submitting job(s)....
05/08/19 04:01:16 From submit: ERROR: Failed submission for job 147157.4 - aborting entire submit05/08/19 04:01:16 From submit:
05/08/19 04:01:16 From submit: ERROR: Failed to queue job.
05/08/19 04:01:16 failed while reading from pipe.05/08/19 04:01:16 Read so far: Submitting job(s)....ERROR: Failed submission for job 147157.4 - aborting entire submitERROR: Failed to queue job.05/08/19 04:01:16 ERROR: submit attempt failed
Â
Shadow logs are not showing any indication for this job but it does show the "status in check_zombie"Âmessage for another job of same Dag. Most of the time I noticed this zombie message appearing in sched logs during the time of issue but it's not everytime.Â
Â
05/08/19 04:01:15 (pid:1316698) Shadow pid 1054140 for job 147147.1 reports job exit reason 100.05/08/19 04:01:15 (pid:1316698) ERROR fetching job (147154.6) status in check_zombie !
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053429 for job 147147.4 exited with status 100
condor version detailsÂ
Â
$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $
$CondorPlatform: x86_64_RedHat6 $
Â
Anyone else saw this issue?Â
Â
Thanks & Regards,
Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/