Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Network filesystem failed to initialize logs
- Date: Thu, 29 Aug 2019 17:34:08 +0000
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Network filesystem failed to initialize logs
On 8/29/2019 11:55 AM, Christopher Harrison via HTCondor-users wrote:
> Are you using autofs to mount the glusterfs fuse mount?ÂÂ If so, my
> guess is you have a race condition whereby autofs is not mounting before
> the condor jobs show up.ÂÂ This happened to us a lot too (we use
> autofs).ÂÂ The way we got around this is by applying a precondition to
> the job (through a shell script) to touch a file in the directory.
>
> I hope this helps,
> ÂÂÂ -C
I am very in the below problem and Christopher's wisdom about how he
fixed it...
Please correct me if I am misunderstanding: It sounds like you guys are
saying the first file access into a volume automounted by autofs can
fail. If the first file access is performed by the condor_starter in
order to create/write the job error or log files, the job ends up on
hold. Sounds like Christopher worked around this by having a shell
script run in advance of job (somehow) which touches a file in the
directory... this touch operation may fail just like the does when the
condor_starter is trying to setup the error/log files, but nobody cares
because the whole point of the touch operation was just to kick autofs
into action. Do I have it right?
ps question: is the autofs mount a 'hard' mount (i.e. I/O to glusterfs
should block until it is performed successfully) or a 'soft' mount (i.e.
I/O to glusterfs will not block, but could instead quickly return an
error) ?
Thanks guys
Todd
>
> On 8/29/19 5:26 PM, JoÃo BaÃto wrote:
>> Hi,
>>
>> Some of my users keep giving their jobs put on hold due to problems
>> with the initialization of the error and logs files. They are setting
>> the path of these files to the network filesystem (glusterfs mount via
>> fuse).
>>
>> The only way to fix this is to force an ls on the target directory and
>> then run condor_release.
>>
>> Any ideas on why this is happening?
>>
>> I'm running HTCondor v.8.8.4 on CentOS 7.6.
>>
>> Thanks!
>> *JoÃo BaÃto*
>> ---------------
>> *Scientific Computing and Software Platform
>> *
>> Champalimaud Research
>> Champalimaud Center for the Unknown
>> Av. BrasÃlia, Doca de PedrouÃos
>> 1400-038 Lisbon, Portugal
>> fchampalimaud.org <https://www.fchampalimaud.org/>
>>
>> _______________________________________________
>> HTCondor-users mailing list
>> To unsubscribe, send a message tohtcondor-users-request@xxxxxxxxxxx with a
>> subject: Unsubscribe
>> You can also unsubscribe by visiting
>> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>>
>> The archives can be found at:
>> https://lists.cs.wisc.edu/archive/htcondor-users/
>
>
> --
>
>
> Christopher Harrison
> Systems Engineer
> Department of Biostatistics & Medical Informatics
> University of Wisconsin School of Medicine and Public Health
> Office 240 Warf
> 610 Walnut Street
> Madison, WI 53726
> 608.3476.6967
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685