[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor with Fscache





On 2025/11/21 5:02 am, gagan tiwari wrote:
Hi Guys,
 Â Â Â Â Â Â Â Â Â Â ÂAnyone has any ideas / adviceÂon this?ÂPlz letÂme know
Thanks,
Gagan

Hi Gagan,

Just my random thoughts ...

At ESAT we've been using fscache (NFS client side caching) since many years now here,
And indeed, the data 'transfer' (not in HTCondor sense) for the first job to access
larger data sets is significantly longer, as it primes the cache.  So it would
make sense to improve the RANK of the compute node to attract similar jobs.

However, I've always seen this is unpractical, for several reasons:

- HTCondor has zero knowledge over what data a job using NFS really really
  has read. One of the big disadvantages of using NFS...

- One could somehow collect mounts and try to make some sense out of it, but ...
  It doesn't.  A job immediately crashing might have mounted, but never primed
  the cache.  The cache has its own policy to get stuff out; the cache is
  not (easily)Ârevealing exactly what files are in there... Some jobs mount
  stuff from all over the place for just a small config file. Some jobs
  access only fraction of bigger data sets, which means it doesn't end in the
  cache at all...

- So the only way to infer this knowledge is using information from the job
  description; that could be relatively easily done I think, but would
  again be hit and mis...

  In that case, I'd have the user specify some parameter in the JDF, like
  Needs_NFS_XYZZY = True;   The machine would then run such job and set
  Has_seen_NFS_XYZZY = <seconds ago>.  And then modify the RANK expression
  so that machines with a low value for Has_seen_NFS_XYZZY attracts jobs
  with Needs_NFS_XYZZY.

  But again, at least in our situation, it's not worth the effort.  Especially
  as near-identical jobs would be 'pipelined' to certain machines anyhow.

But if you only deal with a very well known set of jobs, it might be different...
You could even make a dummy job just to prime the cache, if nothing else can interfere.

Chears, B.