Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] DAGMan Hangs Near End
- Date: Fri, 28 Sep 2012 17:02:28 -0500
- From: Nathan Panike <nwp@xxxxxxxxxxx>
- Subject: Re: [Condor-users] DAGMan Hangs Near End
Oren:
DAGMan thinks a job was submitted, but never saw it terminate. So it
missed the event in the log. Here is what to do in this case, to
complete the job:
1. Figure out which node is still pending.
2. condor_rm the DAG.
3. Edit the rescue dagfile to mark the pending job as "DONE"
4. Resubmit the DAG with condor_submit_dag.
5. Also, we need to figure out why DAGMan never recognized the node was done
itself. To this end, could you send the .dagman.out file to me, along
with the userlog files?
Nathan Panike
On Fri, Sep 28, 2012 at 01:27:10PM -0500, Oren Livne wrote:
> Dear All,
>
> I have a DAGMan pipeline that starts fine, but never completes,
> because the last few jobs are queued but never run. A down-scaled
> version of it works, so I doubt that it's a programming error. There
> are many available nodes; why won't those jobs run? How can I
> analyze the individual job within the DAGMan that says "Queued"?
>
> Thank you so much,
> Oren
>
> -- Submitter: ibicluster.uchicago.cc : <172.16.0.149:42470> :
> ibicluster.uchicago.cc
> ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
> 904.0 livne 9/28 13:09 0+00:15:40 R 0 7.3
> condor_dagman -f -
>
> 1 jobs; 0 idle, 1 running, 0 held
> ===================================================================================
>
> Total Owner Claimed Unclaimed Matched
> Preempting Backfill
>
> X86_64/LINUX 728 108 0 620 0 0 0
>
> Total 728 108 0 620 0 0 0
>
> ===================================================================================
> 9/28 13:23:33 Event: ULOG_EXECUTE for Condor Node D_chr10 (1009.0)
> 9/28 13:23:33 Number of idle job procs: 1
> 9/28 13:23:43 Event: ULOG_JOB_TERMINATED for Condor Node D_chr10 (1009.0)
> 9/28 13:23:43 Node D_chr10 job proc (1009.0) completed successfully.
> 9/28 13:23:43 Node D_chr10 job completed
> 9/28 13:23:43 Number of idle job procs: 1
> 9/28 13:23:43 Of 107 nodes total:
> 9/28 13:23:43 Done Pre Queued Post Ready Un-Ready Failed
> 9/28 13:23:43 === === === === === === ===
> 9/28 13:23:43 104 0 1 0 0 2 0