[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job marked as completed however output files not transferred back



Hi Vikrant,

Would you be willing to send me the job ad from the history for the job, the ShadowLog for that time frame, and possibly the SchedLog for that time frame so I can look a little deeper into what is happening? You can send these directly if you prefer. As for suggestions to prevent this from happening, I notice that your preempt script is calling condor_off on the master daemon. This starts a graceful shut down which allows the job to finish running on the EP. Maybe the -fast flag to evict the jobs right away and create a job failure to run again could avoid the issue.

-Cole Bollig 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Friday, March 3, 2023 1:54 AM
To: Thomas Hartmann <thomas.hartmann@xxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Job marked as completed however output files not transferred back
 
Hello Experts,

Any further input please.


Thanks & Regards,
Vikrant Aggarwal


On Thu, Feb 23, 2023 at 2:57âPM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Thanks Thomas.

I don't have any shared filesystem on preemptible cloud nodes hence it will not be possible to store the results of postscript (transferring back output of post script may have the same faith as the rest of generated output files). Unfortunately this problem is not reproducible at will. 

Does my hypothesis look plausible about job marked as completed without output transfer?

 Hypothesis: 

Edge scenario where job finished computation but during the phase of transfer worker node got preempted because of ON_EXIT condition job marked as completed.


Thanks & Regards,
Vikrant Aggarwal


On Thu, Feb 16, 2023 at 3:04 PM Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
Hi Vikrant,

just a quick hack - but maybe you can check/debug with a postcmd script
the status of the output file transfer status?

Cheers,
   Thomas

On 16/02/2023 08.00, Vikrant Aggarwal wrote:
> Hello Experts,
>
> We are facing a strange situation where the job is marked as completed
> (C) in condor job history but the job has not transferred back the
> output files from worker to submit. Usually in this case the job should
> go into a held state.
>
> TransferOutput = "job.93.result,job.93.dir,coverage.93.zip"
> ShouldTransferFiles = "YES"
> WhenToTransferOutput = "ON_EXIT"
>
> In this worker nodes are running in cloud and they are on-spot VM(s)
> which triggers the following script on shutdown of the node.
>
> #!/bin/bash
> /usr/sbin/condor_off -daemon master
>
>
> Cloud logs showing that VM (Virtual machine) got preempted approx during
> the same time when job is marked as completed.
> Hypothesis:
>
> Edge scenario where job finished computation but during the phase of
> transfer worker node got preempted because of ON_EXIT condition job
> marked as completed.
>
> Could above assessment be right here?
>
>   If suggestion is to use ON_EXIT_OR_EVICT then we can't use it as it
> may fill up the spool very quickly at the scale.
>
> Any other suggestion to ensure that job is not marked as completed with
> missing output?
>
>
> Thanks & Regards,
> Vikrant Aggarwal
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/