Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] dagman job won't finish
- Date: Wed, 31 Jan 2018 14:49:37 -0500
- From: Rami Vanguri <rami.vanguri@xxxxxxxxx>
- Subject: Re: [HTCondor-users] dagman job won't finish
Ah okay, my POST script only deletes some tarballs. I have been using
a test job and the problem occurs 100% of the time. I've submitted
this job around 100 times so far. Since it hangs I'm reluctant to
start production.
--Rami
On Wed, Jan 31, 2018 at 2:38 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
> On 1/31/2018 1:10 PM, Rami Vanguri wrote:
>> Hi Todd,
>>
>> Thanks for the reply!
>>
>> condor_version:
>> $CondorVersion: 8.6.8 Oct 31 2017 $
>> $CondorPlatform: X86_64-CentOS_6.9 $
>>
>> OS:
>> Description: CentOS release 6.9 (Final)
>>
>> So your question about editing/removing files might be the answer, my
>> POST script transfers (to hadoop) and removes the resulting tarballs
>> from the preceding steps. I do this because I will be submitting
>> hundreds of these and don't want to keep the output around in the
>> scratch directory. If that is indeed what's causing the issue, how
>> can I remove files safely?
>>
>
> Having your POST script move job output someplace should be fine. I asked because I was concerned that your POST script may actually move (or remove) files that DAGMan itself needs to reference, like your
> .dag.nodes.log and other files created by DAGMan. These files created by DAGMan should not be moved/removed until DAGMan exits.
>
> Another question: Does this problem consistently always occur (i.e. DAGMan always gets stuck with this workflow), or only occasionally? If the latter, are we talking 1 in every 5 workflow runs, or 1 in 50,000 ?
>
> Thanks,
> Todd
>
>
>
>
>> --Rami
>>
>> On Wed, Jan 31, 2018 at 2:05 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>> Hi Rami,
>>>
>>> I see what you mean, the below certainly looks strange to me. I will ask
>>> one of our DAGMan experts here to take a look and report back to the list.
>>>
>>> In the meantime, could you tell us what version of HTCondor you are using
>>> (i.e. output of condor_version on your submit machine), and on what
>>> operating system?
>>>
>>> Were any files in /home/ramiv/nsides_IN/condor/localTEST edited or removed
>>> during the test? Is this subdirectory on a shared file system?
>>>
>>> Thanks
>>> Todd
>>>
>>>
>>> On 1/30/2018 5:47 PM, Rami Vanguri wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am running a DAG job that has several components which seem to all
>>>> run fine, but then the actual dag job never stops running even though
>>>> all of the jobs were successful.
>>>>
>>>> Here is an excerpt from the .dag.nodes.log file:
>>>> 005 (69535.000.000) 01/30 14:47:07 Job terminated.
>>>> (1) Normal termination (return value 0)
>>>> Usr 0 00:11:47, Sys 0 00:02:08 - Run Remote Usage
>>>> Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
>>>> Usr 0 00:11:47, Sys 0 00:02:08 - Total Remote Usage
>>>> Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
>>>> 0 - Run Bytes Sent By Job
>>>> 0 - Run Bytes Received By Job
>>>> 0 - Total Bytes Sent By Job
>>>> 0 - Total Bytes Received By Job
>>>> Partitionable Resources : Usage Request Allocated
>>>> Cpus : 1 1
>>>> Disk (KB) : 250000 1 6936266
>>>> Memory (MB) : 583 2048 2048
>>>> ...
>>>> 016 (69538.000.000) 01/30 15:29:24 POST Script terminated.
>>>> (1) Normal termination (return value 0)
>>>> DAG Node: C
>>>>
>>>> ..and here is an excerpt from the .dagman.out file:
>>>> 01/30/18 15:29:22 Node C job proc (69538.0.0) completed successfully.
>>>> 01/30/18 15:29:22 Node C job completed
>>>> 01/30/18 15:29:22 Running POST script of Node C...
>>>> 01/30/18 15:29:22 Warning: mysin has length 0 (ignore if produced by
>>>> DAGMan; see gittrac #4987, #5031)
>>>> 01/30/18 15:29:22 DAG status: 0 (DAG_STATUS_OK)
>>>> 01/30/18 15:29:22 Of 4 nodes total:
>>>> 01/30/18 15:29:22 Done Pre Queued Post Ready Un-Ready
>>>> Failed
>>>> 01/30/18 15:29:22 === === === === === ===
>>>> ===
>>>> 01/30/18 15:29:22 3 0 0 1 0 0
>>>> 0
>>>> 01/30/18 15:29:22 0 job proc(s) currently held
>>>> 01/30/18 15:29:24 Initializing user log writer for
>>>> /home/ramiv/nsides_IN/condor/localTEST/./workflow_localTEST.dag.nodes.log,
>>>> (69538.0.0)
>>>> 01/30/18 15:39:23 601 seconds since last log event
>>>> 01/30/18 15:39:23 Pending DAG nodes:
>>>> 01/30/18 15:39:23 Node C, HTCondor ID 69538, status STATUS_POSTRUN
>>>>
>>>> The dag control file only has 4 jobs structured like this:
>>>> PARENT B0 B1 B2 CHILD C
>>>>
>>>> What could cause my job to be stuck in POSTRUN even though it runs
>>>> successfully with the proper exit code?
>>>>
>>>> Thanks for any help.
>>>>
>>>> --Rami
>
>
> --
> Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
> Center for High Throughput Computing Department of Computer Sciences
> HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
> Phone: (608) 263-7132 Madison, WI 53706-1685