Dear Nathan,This worked. The job in question was B_chr21. I am attaching a tarball of the requested log files. The Condor version is old (7.0.5), and I do not have control over its administration.
Thank you and please let me know what you find. Oren On 9/28/2012 5:02 PM, Nathan Panike wrote:
Oren: DAGMan thinks a job was submitted, but never saw it terminate. So it missed the event in the log. Here is what to do in this case, to complete the job: 1. Figure out which node is still pending. 2. condor_rm the DAG. 3. Edit the rescue dagfile to mark the pending job as "DONE" 4. Resubmit the DAG with condor_submit_dag. 5. Also, we need to figure out why DAGMan never recognized the node was done itself. To this end, could you send the .dagman.out file to me, along with the userlog files? Nathan Panike On Fri, Sep 28, 2012 at 01:27:10PM -0500, Oren Livne wrote:Dear All, I have a DAGMan pipeline that starts fine, but never completes, because the last few jobs are queued but never run. A down-scaled version of it works, so I doubt that it's a programming error. There are many available nodes; why won't those jobs run? How can I analyze the individual job within the DAGMan that says "Queued"? Thank you so much, Oren -- Submitter: ibicluster.uchicago.cc : <172.16.0.149:42470> : ibicluster.uchicago.cc ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 904.0 livne 9/28 13:09 0+00:15:40 R 0 7.3 condor_dagman -f - 1 jobs; 0 idle, 1 running, 0 held =================================================================================== Total Owner Claimed Unclaimed Matched Preempting Backfill X86_64/LINUX 728 108 0 620 0 0 0 Total 728 108 0 620 0 0 0 =================================================================================== 9/28 13:23:33 Event: ULOG_EXECUTE for Condor Node D_chr10 (1009.0) 9/28 13:23:33 Number of idle job procs: 1 9/28 13:23:43 Event: ULOG_JOB_TERMINATED for Condor Node D_chr10 (1009.0) 9/28 13:23:43 Node D_chr10 job proc (1009.0) completed successfully. 9/28 13:23:43 Node D_chr10 job completed 9/28 13:23:43 Number of idle job procs: 1 9/28 13:23:43 Of 107 nodes total: 9/28 13:23:43 Done Pre Queued Post Ready Un-Ready Failed 9/28 13:23:43 === === === === === === === 9/28 13:23:43 104 0 1 0 0 2 0_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/
-- A person is just about as big as the things that make him angry.
Attachment:
pipeline.tgz
Description: Binary data