[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_drain error



The <> around the address and the sock= are part of the address, and are needed. 


condor_drain -debug "condor-execute"

should also work, since "condor-execute" is the value of the Machine attribute in the collector. 


-tj



From: Curtis Spencer
Sent: Friday, April 18, 2025 12:58 PM
To: John M Knoeller
Cc: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] condor_drain error

I ran that command and tried running `condor_drain`  using the entire value of MyAddress but it hung and I had to exit. I also tried with the Ip address and machine alias but got the same errors as before.

```
$ condor_status -af:h Name Machine MyAddress
Name                 Machine        MyAddress                                                                                      
slot1@condor-execute condor-execute <192.168.9.161:9618?addrs=192.168.9.161-9618&alias=condor-execute&noUDP&sock=startd_11727_5337>
$ sudo condor_drain -debug 192.168.9.161:9618?addrs=192.168.9.161-9618&alias=condor-execute&noUDP&sock=startd_11727_5337
[1] 41511
[2] 41512
[3] 41513
debian@condor-master:~$ -bash: noUDP: command not found
04/18/25 17:39:47 condor_read(): Socket closed abnormally when trying to read 5 bytes from startd 192.168.9.161:9618?addrs=192.168.9.161-9618, errno=104 Connection reset by peer
04/18/25 17:39:47 SECMAN: no classad from server, failing
04/18/25 17:39:47 ERROR: SECMAN:2011:Connection closed during command authorization. Probably due to an unknown command.
Attempt to send DRAIN_JOBS to startd <192.168.9.161:9618> failed
Failed to start DRAIN_JOBS command to 192.168.9.161:9618?addrs=192.168.9.161-9618
^C
[1]   Exit 1                  sudo condor_drain -debug 192.168.9.161:9618?addrs=192.168.9.161-9618
[2]-  Done                    alias=condor-execute
[3]+  Exit 127                noUDP
$ sudo condor_drain -debug 192.168.9.161
04/18/25 17:38:45 Can't find address for startd 192.168.9.161
ERROR: Can't find address for startd 192.168.9.161
$ sudo condor_drain -debug condor-execute
ERROR: unknown host condor-execute
```

Thanks,

Curtis

On Fri, Apr 18, 2025 at 7:17âAM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
The HTCondor tools mostly don't use DNS to locate daemons when they want to send commands. 

The way most HTCondor tools, including condor_drain, determine the address of a daemon is by querying the collector,  so the argument you pass must be the name of the daemon in the collector. 

Try running 

condor_status -af:h Name Machine MyAddress

The MyAddress value is what condor_drain needs to lookup in order to see send the drain command,   It will look for entries in the collector where either the Name or the Machine value matches what you passed to condor_drain.    

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, April 17, 2025 3:43 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: [HTCondor-users] condor_drain error
 
I have a test condor pool with two nodes (installed using `get_htcondor`):
* Node A (hostname `condor-master`; IP 192.168.9.7) has `use role:get_htcondor_central_manager` and `use role:get_htcondor_submit` roles
* Node B (hostname `condor-execute`; IP 192.168.9.161) has `use role:get_htcondor_execute`

I ran `systemctl status condor` on both nodes to confirm that the master has condor_collector, condor_negotiator, and condor_schedd running and that the execute node has condor_startd running.

Both nodes have `use security:recommended` in `/etc/condor/config.d/00-security`.

I have run `condor_status` and confirmed that the execute machine has joined the pool and I have run `condor_submit` with a test script and confirmed that the job ran.

I have copied the `/etc/condor/passwords.d/POOL` file from the master node to the execute node and confirmed 0600 permissions for that file on both nodes.

I have run `condor_token_request` on the master node and approved the request on the execute node using `condor_token_request_approve` and stored the returned key in `/etc/condor/tokens.d/admin@condor` on the master node.

I then ran `condor_drain -debug 192.168.9.161` but got this error: "Can't find address for startd 192.168.9.161". I thought this could be related to using the IP address instead of a DNS name so I added an entry to my /etc/hosts file and ran `condor_drain -debu condor-execute` but I got this error: "ERROR: unknown host condor-execute"

Conceptually, I feel like I understand how IDTOKEN auth is supposed to work (and to some extent, I think it is working since the execute machine was able to join the pool), but I can't figure out why `condor_drain` won't work.

Thanks,

Curtis