[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to start DRAIN_JOBS command



You need to copy the token you created onto the machine where you wish to run the condor_drain command.  If you run the command as root then the token should where

 

condor_config_val SEC_TOKEN_SYSTEM_DIRECTORY

 

is configured.  this defaults to /etc/condor/tokens.d for a rootly HTCondor install on Linux.

 

If you do not run the command as root, then you should copy the token into your personal token directory, which will be where SEC_TOKEN_DIRECTORY is configured or  ~/tokens.d.

 

If the token is in the correct directory, then condor_token_list will show it.  If you forget all of this

 

condor_token_list âhelp

 

Will remind you

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Curtis Spencer via HTCondor-users
Sent: Wednesday, October 26, 2022 5:51 PM
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Failed to start DRAIN_JOBS command

 

Thanks for the clarification! I have copied the token signing key from the central manager to each machine in the pool and have created a token on each machine in the pool using:

 

`condor_token_create -identity condor@condor > /etc/condor/tokens.d/test`

 

However, I'm not sure how to use that token to run the drain command from the central manager. Do I need to copy the token I generated from the signing key back to the central manager? If so, where do I copy it to?

 

Thanks,

 

Curtis

 

On Fri, Oct 14, 2022 at 8:34 AM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

The condor_drain command requires authentication between the machine where you run the command and the machine that will drain.   get_htcondor will not automatically set things up so that you can use IDTOKEN authentication for this.

 

It will not matter if the machine you are trying to drain has a token. In order to run the drain command you need a token on the machine running the command that was signed by the signing key that the machine you are trying to drain has access to.   get_htcondor will set things up so that you can use condor tools to send commands to the central manger, but not to other machines.

 

If you want to use IDTOKEN auth to send drain commands from a central location to all of the machines on your pool, you will need to put a token signing key on each machine in the pool (the same key name and value) and create a token signed by that key to use to run the drain command.

 

-tj

 

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Curtis Spencer via HTCondor-users
Sent: Thursday, October 13, 2022 1:19 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: [HTCondor-users] Failed to start DRAIN_JOBS command

 

I recently created an HTCondor pool using the setup instructions here: https://htcondor.readthedocs.io/en/latest/getting-htcondor/install-linux-as-root.html

 

I am able to run `condor_submit` and run jobs in the cluster but when I run `condor_drain <example.domain>` I get the following error:

 

```

Attempt to send DRAIN_JOBS to startd <192.168.5.111:9618?addrs=192.168.5.111-9618&alias=blade11.ccb&noUDP&sock=startd_1258_763b> failed
Failed to start DRAIN_JOBS command to slot1_4@<example.domain>

```

This happens regardless of which machine I try to drain.

 

I am new to token authentication. I have read the documentation here: https://htcondor.readthedocs.io/en/latest/admin-manual/security.html#token-authentication and verified that the machine I am trying to drain has a token (I'm assuming that was created automatically during the setup) and that the `tokens.d` directory has read/write only for the root user.

 

It appears that the tokens of the master machine (which I am sending the command from) and to the machine I am trying to drain don't match (not sure if they should?):

 

master

```

~# ls -l /etc/condor/
total 24K
-rw-r--r-- 1 root root 4.5K Jul 14 16:36 condor_config
drwxrwxrwx 1 root root   31 Aug  3 10:34 condor_config.local
drwxr-xr-x 2 root root 4.0K Aug 17 16:23 config.d
drwxr-xr-x 2 root root 4.0K Aug  2 12:09 ganglia.d
drwx------ 2 root root 4.0K Aug  2 12:10 passwords.d
drwx------ 2 root root 4.0K Aug  2 12:10 tokens.d

 

~# condor_token_list
Header: {"alg":"HS256","kid":"POOL"} Payload: {"iat":1659467401,"iss":"condor","jti":"e2d9e9621119863a1103bfeccfe9e9a5","sub":"condor@condor"} File: /etc/condor/tokens.d/condor@condor

```

 

drain

```

~# ls -l /etc/condor/
total 24K
-rw-r--r-- 1 root root 4.5K Jul 14 16:36 condor_config
drwxrwxrwx 1 root root   31 Aug  3 10:36 condor_config.local
drwxr-xr-x 2 root root 4.0K Aug 17 16:15 config.d
drwxr-xr-x 2 root root 4.0K Aug  2 12:42 ganglia.d
drwx------ 2 root root 4.0K Aug  2 12:43 passwords.d
drwx------ 2 root root 4.0K Aug  2 12:43 tokens.d

 

~# condor_token_list
Header: {"alg":"HS256","kid":"POOL"} Payload: {"iat":1659469380,"iss":"condor","jti":"09196bb17bdb987ed400102d4328c5ab","sub":"condor@condor"} File: /etc/condor/tokens.d/condor@condor

```

 

I have tried requesting a new token from the machine I am trying to drain:

```

~# condor_token_request
Token request enqueued.  Ask an administrator to please approve request 3059220.

```

 

But when I tried to approve the request from the master machine I got this error:

```

~# condor_token_request_approve -reqid 3059220
Remote daemon did not provide information for request ID 3059220.

```

I feel like something still isn't configured quite right and that the problem with `condor_drain` is a symptom of that, but I'm not sure what or how to fix it.

 

Any help would be appreciated, thanks!