I'm trying to use HTCondor to submit jobs to our Scarf HPC. At present, this uses Platform LSF, and (following initial work by Andrew Lahiff) I've managed to get this to work (to some extent). However, Scarf is replacing Platform LSF with Slurm, and I'm having trouble getting submission to work with Slurm in the case where the jobscript is in a directory that is not shared with the worker nodes. (I am submitting from a custom Scarf node that has Condor installed. Ultimately, jobs will be submitted to this node from an HTCondor node that is external to Scarf, so sharing won't be an option.) The problem seems to be that the jobscript that is generated by BLAH's slurm_submit.sh assumes that the original jobscript has been copied to a (unique) filename in a sandbox folder, but the copy never happens. The lsf_submit.sh script generates BSUB directives that (I think) instruct LSF to perform the intial copy, but I see no equivalent in slurm_submit.sh. None of this is reflected in the files created by HTCondor: the log file implies that the job ran OK (but consumed no resources), and the output and error files are always empty. Only by modifying the blah scripts to log to somewhere other than /dev/null (and copying the generated jobscripts to file) was I able to get more information about what was going wrong! batch_gahp.config has many options for defining which directories are shared, and for overriding default locations for sandboxes etc. I have tried numerous permutations, to no avail. Is there a better guide to configuration than the comments in batch_gahp.config? What special considerations are required for Slurm? Thanks, Brian |