[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Hardening against NFS failure



Justin,

One option would be to write a check that verifies the status of he
NFS mount and put that in a STARTD_CRON (see
https://research.cs.wisc.edu/htcondor/manual/latest/4_4Hooks.html#SECTION00543000000000000000).
Then your START expression could use that value. For example, if the
attribute from the STARTD_CRON is nfsCheck_IsGood, then you can set

START = $(START) && nfsCheck_IsGood

That way, if the NFS check fails, those slots won't accept jobs until
the check passes again.

On Mon, Feb 27, 2017 at 11:36 AM, Justin Fisher <justin0419@xxxxxxxxx> wrote:
> Hi.
>
> Is there a way to keep a condor job running if an NFS mount goes down during
> that job?
>
> I'm using v8.6 and my submit file looks lie this:
>
> Universe = vanilla
> Requirements = Arch == "X86_64" && TARGET.OpSys == "LINUX"
> Executable = /usr/share/ngspice_2016_08_05/bin/ngspice
> transfer_input_files = $(filename)
> Arguments = -o $Fdb(filename)_$Fn(filename).log $(filename)
> Should_transfer_files = Yes
> When_to_transfer_output = on_exit
> Request_memory = 8 GB
> Request_disk = 50 MB
> Request_cpus = 4
> accounting_group = group_ANALOG
> accounting_group_user = jfisher
> ## Log = log
> Queue filename from /some/file/location/condor.in
>
> The input file is on an NFS share and points to thousands of files on the
> same NFS share. The output is written to another directory on the same NFS
> share.
>
> I find that one of my machines is flaky and the NFS keeps dropping out. When
> that happens many of the submitted jobs all fail with messages saying can't
> find the input file. Is there a way to run divert these files onto a machine
> that is alive? Obviously, I need to find out why this machine keeps crashing
> NFS, but I'm wondering if there is a workaround while I do this?
>
> --
> Kind regards,
>
> Justin Fisher.
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
Ben Cotton
Technical Marketing Manager

Cycle Computing
Better Answers. Faster.

http://www.cyclecomputing.com
twitter: @cyclecomputing