[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Failed to start DRAIN_JOBS command



I recently created an HTCondor pool using the setup instructions here:Âhttps://htcondor.readthedocs.io/en/latest/getting-htcondor/install-linux-as-root.html

I am able to run `condor_submit` and run jobs in the cluster but when I run `condor_drain <example.domain>` I get the following error:

```
Attempt to send DRAIN_JOBS to startd <192.168.5.111:9618?addrs=192.168.5.111-9618&alias=blade11.ccb&noUDP&sock=startd_1258_763b> failed
Failed to start DRAIN_JOBS command to slot1_4@<example.domain>
```
This happens regardless of which machine I try to drain.

I am new to token authentication. I have read the documentation here:Âhttps://htcondor.readthedocs.io/en/latest/admin-manual/security.html#token-authentication and verified that the machine I am trying to drain has a token (I'm assuming that was created automatically during the setup) and that the `tokens.d` directory has read/write only for the root user.

It appears that the tokens of the master machine (which I am sending the command from)ÂandÂto the machine I am trying to drain don't match (not sure if they should?):

master
```
~# ls -l /etc/condor/
total 24K
-rw-r--r-- 1 root root 4.5K Jul 14 16:36 condor_config
drwxrwxrwx 1 root root  31 Aug Â3 10:34 condor_config.local
drwxr-xr-x 2 root root 4.0K Aug 17 16:23 config.d
drwxr-xr-x 2 root root 4.0K Aug Â2 12:09 ganglia.d
drwx------ 2 root root 4.0K Aug Â2 12:10 passwords.d
drwx------ 2 root root 4.0K Aug Â2 12:10 tokens.d

~# condor_token_list
Header: {"alg":"HS256","kid":"POOL"} Payload: {"iat":1659467401,"iss":"condor","jti":"e2d9e9621119863a1103bfeccfe9e9a5","sub":"condor@condor"} File: /etc/condor/tokens.d/condor@condor
```

drain
```
~# ls -l /etc/condor/
total 24K
-rw-r--r-- 1 root root 4.5K Jul 14 16:36 condor_config
drwxrwxrwx 1 root root  31 Aug Â3 10:36 condor_config.local
drwxr-xr-x 2 root root 4.0K Aug 17 16:15 config.d
drwxr-xr-x 2 root root 4.0K Aug Â2 12:42 ganglia.d
drwx------ 2 root root 4.0K Aug Â2 12:43 passwords.d
drwx------ 2 root root 4.0K Aug Â2 12:43 tokens.d

~# condor_token_list
Header: {"alg":"HS256","kid":"POOL"} Payload: {"iat":1659469380,"iss":"condor","jti":"09196bb17bdb987ed400102d4328c5ab","sub":"condor@condor"} File: /etc/condor/tokens.d/condor@condor
```

I have tried requesting a new token from the machine I am trying to drain:
```
~# condor_token_request
Token request enqueued. Ask an administrator to please approve request 3059220.
```

But when I tried to approve the request from the master machine I got this error:
```
~# condor_token_request_approve -reqid 3059220
Remote daemon did not provide information for request ID 3059220.
```
I feel like something still isn't configured quite right and that the problem with `condor_drain` is a symptom of that, but I'm not sure what or how to fix it.

Any help would be appreciated, thanks!