[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job marked as completed however output files not transferred back



Hi Vikrant,

just a quick hack - but maybe you can check/debug with a postcmd script the status of the output file transfer status?

Cheers,
  Thomas

On 16/02/2023 08.00, Vikrant Aggarwal wrote:
Hello Experts,

We are facing a strange situation where the job is marked as completed (C) in condor job history but the job has not transferred back the output files from worker to submit. Usually in this case the job should go into a held state.

TransferOutput = "job.93.result,job.93.dir,coverage.93.zip"
ShouldTransferFiles = "YES"
WhenToTransferOutput = "ON_EXIT"

In this worker nodes are running in cloud and they are on-spot VM(s) which triggers the following script on shutdown of the node.

#!/bin/bash
/usr/sbin/condor_off -daemon master


Cloud logs showing that VM (Virtual machine) got preempted approx during the same time when job is marked as completed.
Hypothesis:

Edge scenario where job finished computation but during the phase of transfer worker node got preempted because of ON_EXIT condition job marked as completed.

Could above assessment be right here?

ÂIf suggestion is to use ON_EXIT_OR_EVICTÂthen we can't use it as it may fill up the spool very quickly at the scale.

Any other suggestion to ensure that job is not marked as completed with missing output?


Thanks & Regards,
Vikrant Aggarwal

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature