_______________________________________________Itâs still not the right time period of the SchedLog. ÂÂand the error about Zombie checking is for different job, itâs not even the same cluster.
Â
You might check to see if you have a failing disk, that *might* explain both of these problems.ÂÂ I canât think of anything else that could.
Â
-tj
Â
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 10:40 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Random job in DAGman throwing submitERRORÂ
Sorry for typo. Snippet was from schedlogs only not shadow.
On Thu, 9 May, 2019, 21:07 John M Knoeller, <johnkn@xxxxxxxxxxx> wrote:
The ShadowLog doesnât isnât where to look for this error. when submit fails, there will never be a shadow for that job.
you should look in the SchedLog at time 05/08/19 04:01:16 to see if it has a reason why the submit failed.
Â
-tj
Â
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 3:15 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Random job in DAGman throwing submitERRORÂ
Hello Team,
Â
We are facing weird issue with DAGman consists of 54 jobs. One of the job in DAG is randomly throwing an error at no particular frequency. I am trying to debug the reason for same.Â
Â
05/08/19 04:01:16 From submit: Submitting job(s)....
05/08/19 04:01:16 From submit: ERROR: Failed submission for job 147157.4 - aborting entire submit05/08/19 04:01:16 From submit:
05/08/19 04:01:16 From submit: ERROR: Failed to queue job.
05/08/19 04:01:16 failed while reading from pipe.05/08/19 04:01:16 Read so far: Submitting job(s)....ERROR: Failed submission for job 147157.4 - aborting entire submitERROR: Failed to queue job.05/08/19 04:01:16 ERROR: submit attempt failed
Â
Shadow logs are not showing any indication for this job but it does show the "status in check_zombie"Âmessage for another job of same Dag. Most of the time I noticed this zombie message appearing in sched logs during the time of issue but it's not everytime.Â
Â
05/08/19 04:01:15 (pid:1316698) Shadow pid 1054140 for job 147147.1 reports job exit reason 100.05/08/19 04:01:15 (pid:1316698) ERROR fetching job (147154.6) status in check_zombie !
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053429 for job 147147.4 exited with status 100
condor version detailsÂ
Â
$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $
$CondorPlatform: x86_64_RedHat6 $
Â
Anyone else saw this issue?Â
Â
Thanks & Regards,
Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/