I see that our documentation about flocking is confusing and the configuration details are out-of-date. I will need to work on improving those. In the mean time, I will give a better explanation here.
Flocking is a way for an Access Point (i.e. a condor_schedd) to find machines to run all of its jobs in HTCondor pools beyond its local one. Itâs configured by the administrator; the users donât have to do anything special. Most of your post describes
how a user can directly submit individual jobs to an Access Point in a remote pool, which is a different (and usually inferior) process.
For each access point that should flock to another pool, you need to do two things:
1) Tell the schedd where it should flock
2) Give the schedd permission to join the remote pool
In the following example, letâs say you want the schedd at machine
submit.org1.eduÂto flock to the pool whose Central Manager is
cm.org2.edu.
For step 1, you set FLOCK_TO in the scheddâs configuration to name the collector of the remote pool. For example:
For step 2, the easiest thing to do is create an IDToken at
cm.org2.edu,Âgive it to the flocking schedd, and add the IDTokenâs identity to the ADVERTISE_SCHEDD authorization list.
To create the IDToken, run this command:
Then, write the output of the command to a file in /etc/condor/tokens.d/ on the Access Point. This is a secret, so it should not be publicly readable (file should be owned by root with no group or world access permissions).
Finally, give the identity of the token permission to join the pool as an Access Point. Add the following line to the configuration files on
cm.org2.edu:
Once everything is done, do a condor_reconfig on both machines.
When the schedd at
submit.org1.eduÂhas jobs that can't be matched in its local pool (say because the pool is full running other jobs), it will start advertising to the collector at
cm.org2.eduÂand can start receiving matches for machines in that pool.
I hope thatâs enough information for you to get flocking working.
Â- Jaime
Hi
I'm new to htcondor, and I need to set up flocking between 2 htcondor pools (SRTA and ASU).
I also added variables "SCHED_NAME=headnode@" and "SHCED_NAME=asuslrhd@" in the respective local configuration files. I made "ALLOW_WRITE=*" in both configuration files.
I tried to use "condor_fetch_token -remote <remote node> -token <file>" at each site for a test user, but it failed telling me that it couldn't find that daemon. So, I just used "condor_token_create" at each site, and copied the token file to the other
site (I'm not sure what I did is right, but I'm trying anything now).
Currently, I can use something like "condor_status -pool <remote pool>", but when I try to submit a job with "condor_submit -remote asuslrhd@ .. ", it fails telling me "ERROR: Can't find address of schedd asuslrhd@".
I tried to look for any tutorial on setup and usage of flocking, but what I found was more of a presentation rather than a detailed step-by-step.
Please help
_______________________________________________