Re: [HTCondor-users] Random job in DAGman throwing submitERROR

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On Sat, May 11, 2019 at 5:11 PM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:

Thanks for your response. No failing disk on the node.Â

There is no schedlog for 04:01:16 when the job submission was failed.Â

05/08/19 04:01:15 (pid:1316698) Shadow pid 1055394 for job 147153.3 reports job exit reason 100.
05/08/19 04:01:15 (pid:1316698) Shadow pid 1054140 for job 147147.1 reports job exit reason 100.
05/08/19 04:01:15 (pid:1316698) ERROR fetching job (147154.6) status in check_zombie !
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053429 for job 147147.4 exited with status 100
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053482 for job 147151.3 exited with status 100
05/08/19 04:01:15 (pid:1316698) Shadow pid 1056671 for job 147155.1 exited with status 100
05/08/19 04:01:15 (pid:1316698) Shadow pid 1056674 for job 147155.4 exited with status 100
05/08/19 04:01:15 (pid:1316698) Number of Active Workers 0
05/08/19 04:01:15 (pid:1316698) Number of Active Workers 0
05/08/19 04:01:17 (pid:1316698) Activity on stashed negotiator socket: <IPaddress:32651>
05/08/19 04:01:17 (pid:1316698) Using negotiation protocol: NEGOTIATE

This is the main logic used in dagman submit file.Â

remove_kill_sig = SIGUSR1
+OtherJobRemoveRequirementsÂ Â Â= "DAGManJobId =?= $(cluster)"
# Note: default on_exit_remove _expression_:
# ( ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
# attempts to ensure that DAGMan is automatically
# requeued by the schedd if it exits abnormally or
# is killed (e.g., during a reboot).
on_exit_removeÂ = (ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >=0 && ExitCode <= 2))
copy_to_spoolÂ Â= False

Jobs "147157.4" and "147154.6" are part of same dagman output.Â

# grep -ir '147154.6' condor_20190507.20190508.033002.618997.dag.dagman.out
05/08/19 03:59:28 Reassigning the id of job test_21090507_ from (147154.5.0) to (147154.6.0)
05/08/19 03:59:28 Event: ULOG_SUBMIT for HTCondor Node test_21090507_ (147154.6.0) {05/08/19 03:59:27}
05/08/19 03:59:28 Reassigning the id of job test_21090507_ from (147154.6.0) to (147154.7.0)
05/08/19 03:59:45 Event: ULOG_EXECUTE for HTCondor Node test_21090507_ (147154.6.0) {05/08/19 03:59:28}
05/08/19 04:01:16 Event: ULOG_JOB_TERMINATED for HTCondor Node test_21090507_ (147154.6.0) {05/08/19 04:00:39}
05/08/19 04:01:16 Node test_21090507_ job proc (147154.6.0) completed successfully.

# grep -ir '147157.4' condor_20190507.20190508.033002.618997.dag.dagman.outÂ Â Â Â Â Â Â Â Â Â Â Â Â Â
05/08/19 04:01:16 From submit: ERROR: Failed submission for job 147157.4 - aborting entire submit
05/08/19 04:01:16 Read so far: Submitting job(s)....ERROR: Failed submission for job 147157.4 - aborting entire submitERROR: Failed to queue job.

Thanks & Regards,

Vikrant Aggarwal
On Thu, May 9, 2019 at 11:04 PM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
Itâs still not the right time period of the SchedLog.Â ÂÂand the error about Zombie checking is for different job, itâs not even the same cluster.

Â

You might check to see if you have a failing disk, that *might* explain both of these problems.ÂÂ I canât think of anything else that could.

Â

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 10:40 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Random job in DAGman throwing submitERROR

Â
Sorry for typo. Snippet was from schedlogs only not shadow.
On Thu, 9 May, 2019, 21:07 John M Knoeller, <johnkn@xxxxxxxxxxx> wrote:
The ShadowLog doesnât isnât where to look for this error.Â when submit fails, there will never be a shadow for that job.

you should look in the SchedLog at time 05/08/19 04:01:16 to see if it has a reason why the submit failed.

Â

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Vikrant Aggarwal
Sent: Thursday, May 9, 2019 3:15 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Random job in DAGman throwing submitERROR

Â
Hello Team,

Â
We are facing weird issue with DAGman consists of 54 jobs. One of the job in DAG is randomly throwing an error at no particular frequency.Â I am trying to debug the reason for same.Â

Â
05/08/19 04:01:16 From submit: Submitting job(s)....
05/08/19 04:01:16 From submit: ERROR: Failed submission for job 147157.4 - aborting entire submit
05/08/19 04:01:16 From submit:
05/08/19 04:01:16 From submit: ERROR: Failed to queue job.
05/08/19 04:01:16 failed while reading from pipe.
05/08/19 04:01:16 Read so far: Submitting job(s)....ERROR: Failed submission for job 147157.4 - aborting entire submitERROR: Failed to queue job.
05/08/19 04:01:16 ERROR: submit attempt failed
Â

Shadow logs are not showing any indication for this job but it does show the "status in check_zombie"Âmessage for another job of same Dag. Most of the time I noticed this zombie message appearing in sched logs during the time of issue but it's not everytime.Â

Â
05/08/19 04:01:15 (pid:1316698) Shadow pid 1054140 for job 147147.1 reports job exit reason 100.
05/08/19 04:01:15 (pid:1316698) ERROR fetching job (147154.6) status in check_zombie !
05/08/19 04:01:15 (pid:1316698) Shadow pid 1053429 for job 147147.4 exited with status 100
condor version detailsÂ

Â

$CondorVersion: 8.5.8 Dec 13 2016 BuildID: 390781 $

$CondorPlatform: x86_64_RedHat6 $

Â

Anyone else saw this issue?Â

Â
Thanks & Regards,

Vikrant Aggarwal
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Random job in DAGman throwing submitERROR