Re: [HTCondor-devel] Advice on appropriate design / use of HT Condor


Date: Wed, 21 Oct 2020 10:56:05 -0500
From: MÃtyÃs Selmeci <matyas@xxxxxxxxxxx>
Subject: Re: [HTCondor-devel] Advice on appropriate design / use of HT Condor

Hello,


The "nobody" user doesn't have a shell which is why condor_ssh_to_job doesn't work.

The "Who Jobs Run As" section of the manual details how HTCondor decides which user a job's processes will run as.

In brief, you have two options:

  1. If you have a way of synchronizing the users between the submit node and the execute nodes, such as with LDAP or Puppet, then you can have the jobs run as the user that submitted them. You can tell HTCondor to do this by setting UID_DOMAIN to the same value on the submit node and the execute nodes. (UID_DOMAIN can be anything but the convention is to use the domain of the hosts, so if the submit host is submit.wisc.edu and the execute host is exec1.wisc.edu, then we'd set UID_DOMAIN to wisc.edu. Again, this is just convention.)
  2. If you do not synchronize users, then you should create slot users, which are dedicated users for running jobs. Create users like slot1, slot2, slot3, etc., making sure they have shells. Then set the following variables in the configuration for your execute nodes:
    DEDICATED_EXECUTE_ACCOUNT_REGEXP = slot[0-9]+
    SLOT1_USER = slot1
    SLOT2_USER = slot2
    SLOT3_USER = slot3
    ...


HTCondor cannot create directories for you. If you need to do that, I suggest that you look at DAGMan Workflows; declare your job as a node, then write a script that will arrange the inputs as you need and set that as the PRE script for the node.

DAGMan is very powerful and can let you chain jobs together or run scripts once all the jobs in a node finish. If you don't need all that, you can also just write a script that will arrange the inputs and call condor_submit at the end.


Hope this helps. In the future, I suggest that you send mail to htcondor-users@xxxxxxxxxxx instead of htcondor-devel, since you will receive quicker responses from the community, and others in the community that have similar problems can benefit from your experience.


-Mat

On 10/19/20 4:43 PM, htcondor-devel@xxxxxxxxxxx wrote:
ï
Hi there,

This is my first post.Â

I've been experimenting with a POC of a HTCondor (v8.8.1) cluster on GCP (Centos7) to run aÂa c++ application that deliversÂa Monte Carlo simulation framework for contemporary financial risk analytics and valueÂadjustments.Â

I have set up the cluster and can run simple tests on it. It utilises a machine image as the base machine for the cluster with the relevant compiled code (as this process takes 1-2 hours to run).Â

Currently the legacy application conducting this analysis can take up to 8 hours and this POC is aiming to provide a way to dramatically reduce the runtime and also produce additional analyses.

The current data/job flow I have been using doesn't work:Â
* Submit job (and transfer credentials)
* Download analysis specification, product portfolio and market input files from google storage per counterparty (via gsutil)Â
* Run Âc++ app with initial input files
* Writes outputs (e.g. monte carlo simulation outputs) to local dir
* Upload back to google storage per counterparty

I have made sure I can run the app on the condor-compute node if I SSH directly to it.

Some questions ... perhaps you can help me understand how HTCondor works?
* Can I specify under which user the jobs run? Currently itâs running as the user "nobody" and permissions are at least one of the problems. Can I run as another user with the correct permissions? I haven`t been able to find information on this.Â
* Does HTCondor allow what I am trying to do....create sub directories to pull inputs from, then write to a directory then upload to GCP?ÂMost of the examples Iâve read require passing all of the files at the time of jobs submission.
* Finally, I tried to debug in interactive mode as per a tutorial but receive the notice "this account is not available" - I couldn't find information on this.

Overall, likely both a permission issue with the "nobody" user and perhaps an environment variable issue.Â

Your help most appreciated.Â

Regards,

Forde





_______________________________________________
HTCondor-devel mailing list
HTCondor-devel@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel
[← Prev in Thread] Current Thread [Next in Thread→]