[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job marked as completed however output files not transferred back



Thanks Thomas.

I don't have any shared filesystem on preemptible cloud nodes hence it will not be possible to store the results of postscriptÂ(transferring back output of post script may have the same faith as the rest of generated output files). Unfortunately this problem is not reproducible at will.Â

Does my hypothesis look plausible about job marked as completed without output transfer?

ÂHypothesis:Â

Edge scenario where job finished computation but during the phase of transfer worker node got preempted because of ON_EXIT condition job marked as completed.


Thanks & Regards,
Vikrant Aggarwal


On Thu, Feb 16, 2023 at 3:04 PM Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
Hi Vikrant,

just a quick hack - but maybe you can check/debug with a postcmd script
the status of the output file transfer status?

Cheers,
 ÂThomas

On 16/02/2023 08.00, Vikrant Aggarwal wrote:
> Hello Experts,
>
> We are facing a strange situation where the job is marked as completed
> (C) in condor job history but the job has not transferred back the
> output files from worker to submit. Usually in this case the job should
> go into a held state.
>
> TransferOutput = "job.93.result,job.93.dir,coverage.93.zip"
> ShouldTransferFiles = "YES"
> WhenToTransferOutput = "ON_EXIT"
>
> In this worker nodes are running in cloud and they are on-spot VM(s)
> which triggers the following script on shutdown of the node.
>
> #!/bin/bash
> /usr/sbin/condor_off -daemon master
>
>
> Cloud logs showing that VM (Virtual machine) got preempted approx during
> the same time when job is marked as completed.
> Hypothesis:
>
> Edge scenario where job finished computation but during the phase of
> transfer worker node got preempted because of ON_EXIT condition job
> marked as completed.
>
> Could above assessment be right here?
>
>Â ÂIf suggestion is to use ON_EXIT_OR_EVICTÂthen we can't use it as it
> may fill up the spool very quickly at the scale.
>
> Any other suggestion to ensure that job is not marked as completed with
> missing output?
>
>
> Thanks & Regards,
> Vikrant Aggarwal
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/