[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Q: BLAH configuration for non-shared submission to Slurm?



To the best of my knowledge, SLURM has no facilities for transferring job files between machines. It assumes you have a shared filesystem for all job files. Thatâs why you donât see any directives in slurm_submit.sh.
The BLAHP doesnât copy job files from a local filesystem to a shared one on the submit machine. It should probably give an error if it detects that job files are on a local filesystem and the batch system canât move them, but that currently doesnât happen.

For your current testing, all of the job files (including the original job script) should be on the shared filesystem. In your ultimate setup, the HTCondor spool directory will need to be on the shared filesystem on your custom Scarf node. Also, submission from the other HTCondor node will have to include spooling of job files (either Condor-C or condor_submit -remote).

 - Jaime

On Jul 30, 2018, at 11:20 AM, Brian Ritchie - UKRI STFC <brian.ritchie@xxxxxxxxxx> wrote:

I'm trying to use HTCondor to submit jobs to our Scarf HPC. At
present, this uses Platform LSF, and (following initial work by Andrew
Lahiff) I've managed to get this to work (to some extent). However,
Scarf is replacing Platform LSF with Slurm, and I'm having trouble
getting submission to work with Slurm in the case where the jobscript
is in a directory that is not shared with the worker nodes. (I am
submitting from a custom Scarf node that has Condor
installed. Ultimately, jobs will be submitted to this node from an
HTCondor node that is external to Scarf, so sharing won't be an
option.)
 
The problem seems to be that the jobscript that is generated by BLAH's
slurm_submit.sh assumes that the original jobscript has been copied to
a (unique) filename in a sandbox folder, but the copy never happens.
The lsf_submit.sh script generates BSUB directives that (I think)
instruct LSF to perform the intial copy, but I see no equivalent in
slurm_submit.sh.
 
None of this is reflected in the files created by HTCondor: the log
file implies that the job ran OK (but consumed no resources), and the
output and error files are always empty. Only by modifying the blah
scripts to log to somewhere other than /dev/null (and copying the
generated jobscripts to file) was I able to get more information about
what was going wrong!
 
batch_gahp.config has many options for defining which directories are
shared, and for overriding default locations for sandboxes etc. I have
tried numerous permutations, to no avail.
 
Is there a better guide to configuration than the comments in batch_gahp.config?
What special considerations are required for Slurm?
 
Thanks,
  Brian