[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] can't get flocking to submit jobs



Thank you very much. I'll try it.

Regards

On Mon, Apr 1, 2024, 6:11 PM Jaime Frey via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
I see that our documentation about flocking is confusing and the configuration details are out-of-date. I will need to work on improving those. In the mean time, I will give a better explanation here.

Flocking is a way for an Access Point (i.e. a condor_schedd) to find machines to run all of its jobs in HTCondor pools beyond its local one. Itâs configured by the administrator; the users donât have to do anything special. Most of your post describes how a user can directly submit individual jobs to an Access Point in a remote pool, which is a different (and usually inferior) process.

For each access point that should flock to another pool, you need to do two things:
1) Tell the schedd where it should flock
2) Give the schedd permission to join the remote pool

In the following example, letâs say you want the schedd at machine submit.org1.eduÂto flock to the pool whose Central Manager is cm.org2.edu.

For step 1, you set FLOCK_TO in the scheddâs configuration to name the collector of the remote pool. For example:

 FLOCK_TO = cm.org2.edu

For step 2, the easiest thing to do is create an IDToken at cm.org2.edu,Âgive it to the flocking schedd, and add the IDTokenâs identity to the ADVERTISE_SCHEDD authorization list.

To create the IDToken, run this command:

 condor_token_create -identity condor@xxxxxxxxxxxxxxx

Then, write the output of the command to a file in /etc/condor/tokens.d/ on the Access Point. This is a secret, so it should not be publicly readable (file should be owned by root with no group or world access permissions).

Finally, give the identity of the token permission to join the pool as an Access Point. Add the following line to the configuration files on cm.org2.edu:

ÂÂALLOW_ADVERTISE_SCHEDD = $(ALLOW_ADVERTISE_SCHEDD) condor@xxxxxxxxxxxxxxx

Once everything is done, do a condor_reconfig on both machines.

When the schedd at submit.org1.eduÂhas jobs that can't be matched in its local pool (say because the pool is full running other jobs), it will start advertising to the collector at cm.org2.eduÂand can start receiving matches for machines in that pool.

I hope thatâs enough information for you to get flocking working.

Â- Jaime

On Mar 29, 2024, at 7:11âAM, mohammed shambakey <shambakey1@xxxxxxxxx> wrote:

Hi

I'm new to htcondor, and I need to set up flocking between 2 htcondor pools (SRTA and ASU).

I tried to follow the instructions at "https://htcondor.readthedocs.io/en/latest/grid-computing/connecting-pools-with-flocking.html#flocking-configuration" to set up flocking. I added all the configuration variables in the previous site.
I also added variables "SCHED_NAME=headnode@" and "SHCED_NAME=asuslrhd@" in the respective local configuration files. I made "ALLOW_WRITE=*" in both configuration files.

I tried to use "condor_fetch_token -remote <remote node> -token <file>" at each site for a test user, but it failed telling me that it couldn't find that daemon. So, I just used "condor_token_create" at each site, and copied the token file to the other site (I'm not sure what I did is right, but I'm trying anything now).

Currently, I can use something like "condor_status -pool <remote pool>", but when I try to submit a job with "condor_submit -remote asuslrhd@ .. ", it fails telling me "ERROR: Can't find address of schedd asuslrhd@".

I tried to look for any tutorial on setup and usage of flocking, but what I found was more of a presentation rather than a detailed step-by-step.

Please help

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/