[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failed to start DRAIN_JOBS command



Thanks for the clarification! I have copied the token signing key from the central manager to each machine in the pool and have created a token on each machine in the pool using:

`condor_token_create -identity condor@condor > /etc/condor/tokens.d/test`

However, I'm not sure how to use that token to run the drain command from the central manager. Do I need to copy the token I generated from the signing key back to the central manager? If so, where do I copy it to?

Thanks,

Curtis

On Fri, Oct 14, 2022 at 8:34 AM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:

The condor_drain command requires authentication between the machine where you run the command and the machine that will drain.ÂÂ get_htcondor will not automatically set things up so that you can use IDTOKEN authentication for this.

Â

It will not matter if the machine you are trying to drain has a token. In order to run the drain command you need a token on the machine running the command that was signed by the signing key that the machine you are trying to drain has access to. ÂÂget_htcondor will set things up so that you can use condor tools to send commands to the central manger, but not to other machines.

Â

If you want to use IDTOKEN auth to send drain commands from a central location to all of the machines on your pool, you will need to put a token signing key on each machine in the pool (the same key name and value) and create a token signed by that key to use to run the drain command.

Â

-tj

Â

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Curtis Spencer via HTCondor-users
Sent: Thursday, October 13, 2022 1:19 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: [HTCondor-users] Failed to start DRAIN_JOBS command

Â

I recently created an HTCondor pool using the setup instructions here:Âhttps://htcondor.readthedocs.io/en/latest/getting-htcondor/install-linux-as-root.html

Â

I am able to run `condor_submit` and run jobs in the cluster but when I run `condor_drain <example.domain>` I get the following error:

Â

```

Attempt to send DRAIN_JOBS to startd <192.168.5.111:9618?addrs=192.168.5.111-9618&alias=blade11.ccb&noUDP&sock=startd_1258_763b> failed
Failed to start DRAIN_JOBS command to slot1_4@<example.domain>

```

This happens regardless of which machine I try to drain.

Â

I am new to token authentication. I have read the documentation here:Âhttps://htcondor.readthedocs.io/en/latest/admin-manual/security.html#token-authentication and verified that the machine I am trying to drain has a token (I'm assuming that was created automatically during the setup) and that the `tokens.d` directory has read/write only for the root user.

Â

It appears that the tokens of the master machine (which I am sending the command from)ÂandÂto the machine I am trying to drain don't match (not sure if they should?):

Â

master

```

~# ls -l /etc/condor/
total 24K
-rw-r--r-- 1 root root 4.5K Jul 14 16:36 condor_config
drwxrwxrwx 1 root root  31 Aug Â3 10:34 condor_config.local
drwxr-xr-x 2 root root 4.0K Aug 17 16:23 config.d
drwxr-xr-x 2 root root 4.0K Aug Â2 12:09 ganglia.d
drwx------ 2 root root 4.0K Aug Â2 12:10 passwords.d
drwx------ 2 root root 4.0K Aug Â2 12:10 tokens.d

Â

~# condor_token_list
Header: {"alg":"HS256","kid":"POOL"} Payload: {"iat":1659467401,"iss":"condor","jti":"e2d9e9621119863a1103bfeccfe9e9a5","sub":"condor@condor"} File: /etc/condor/tokens.d/condor@condor

```

Â

drain

```

~# ls -l /etc/condor/
total 24K
-rw-r--r-- 1 root root 4.5K Jul 14 16:36 condor_config
drwxrwxrwx 1 root root  31 Aug Â3 10:36 condor_config.local
drwxr-xr-x 2 root root 4.0K Aug 17 16:15 config.d
drwxr-xr-x 2 root root 4.0K Aug Â2 12:42 ganglia.d
drwx------ 2 root root 4.0K Aug Â2 12:43 passwords.d
drwx------ 2 root root 4.0K Aug Â2 12:43 tokens.d

Â

~# condor_token_list
Header: {"alg":"HS256","kid":"POOL"} Payload: {"iat":1659469380,"iss":"condor","jti":"09196bb17bdb987ed400102d4328c5ab","sub":"condor@condor"} File: /etc/condor/tokens.d/condor@condor

```

Â

I have tried requesting a new token from the machine I am trying to drain:

```

~# condor_token_request
Token request enqueued. Ask an administrator to please approve request 3059220.

```

Â

But when I tried to approve the request from the master machine I got this error:

```

~# condor_token_request_approve -reqid 3059220
Remote daemon did not provide information for request ID 3059220.

```

I feel like something still isn't configured quite right and that the problem with `condor_drain` is a symptom of that, but I'm not sure what or how to fix it.

Â

Any help would be appreciated, thanks!