[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_drain error



I suggested the full slot name as a way to bypass the code that tries to build a fully qualified hostname when you give condor_drain a bare hostname.   

When you pass a bare hostname (one without any . characters) to most of the condor commands,  they will try and turn that into a fully qualified hostname and then look that up in the collector to get the address to send the command to.

HTCondor has a strong bias towards using full hostnames, but since the ads  in your collector do not have fully qualified names when tools convert a name on the command line to a fully qualfied name, that lookup in the collector will not match. 

But when you pass a slot name to tools like condor_drain, it will not try and convert that to fully qualified and so that lookup succeeds.

What the HTCondor tools expect you to do to fix this is to configure DEFAULT_DOMAIN_NAME  with a consistent value across your pool, or to configure DNS to return fully qualified names across your pool. 

-tj


From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Monday, June 2, 2025 3:13 PM
To: John M Knoeller <johnkn@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_drain error

Using "slot1@condor-execute" worked.

root@condor-master:~# condor_drain -debug:D_HOSTNAME,D_FULLDEBUG slot1@condor-execute
Sent request to drain the startd <192.168.9.161:9618?addrs=192.168.9.161-9618&alias=condor-execute&noUDP&sock=startd_80729_f52d> with slot1@condor-execute. This only affects the single startd; any other startds running on the same host will not be drained.

I verified that it actually drained the node and that all jobs were marked as idle until I undrained the machine.

Seems like specifying the slot is not typical since it wasn't your first suggestion. AFAIK, I'm only running the one startd on the machine so it would be nice to be able to not have to specify "slot1@". Is there a config change that will make that work?

On Mon, Jun 2, 2025 at 6:40âAM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
I'm not sure why you would not be seeing more D_HOSTNAME messages.  

Try slot1@condor-execute instead of just condor-execute. 


From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Friday, May 30, 2025 5:27 PM
To: John M Knoeller <johnkn@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_drain error

I don't get any additional information with the additional debug flags. Here's the output with and without DEFAULT_DOMAIN_NAME set on both condor-master and condor-execute:

root@condor-master:~# condor_drain -debug:D_HOSTNAME,D_FULLDEBUG condor-execute
05/30/25 22:24:55 Can't find address for startd condor-execute.ltc
ERROR: Can't find address for startd condor-execute.ltc

root@condor-master:~# condor_drain -debug:D_HOSTNAME,D_FULLDEBUG condor-execute
ERROR: unknown host condor-execute

Thanks,

Curtis

On Fri, May 30, 2025 at 3:22âPM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
you were running

condor_drain -debug:D_HOSTNAME,D_FULLDEBUG before.

D_HOSTNAME is nessary to see the messages I mentioned.


From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Friday, May 30, 2025 2:13 PM
To: John M Knoeller <johnkn@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_drain error
 
The output of `condor_drain -debug condor-execute` (without adding DEFAULT_DOMAIN_NAME) is just "ERROR: unknown host condor-execute".

With "DEFAULT_DOMAIN_NAME = ltc" set on both the master and execute machines, `condor_status` doesn't return anything and the output of `condor_drain -debug condor-execute` and  `condor_drain -debug condor-execute.ltc` is:

05/30/25 19:09:17 Can't find address for startd condor-execute.ltc
ERROR: Can't find address for startd condor-execute.ltc

Thanks,

Curtis

On Fri, May 30, 2025 at 11:50âAM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
You might be able to fix it by adding 

DEFAULT_DOMAIN_NAME= 

o your configuration.

You have not said what the -debug output of condor_drain is.   I would expect to see messages like.

05/29/25 13:21:17.138 Finding proper daemon name for "condor-execute"
05/29/25 13:21:17.138 Daemon name contains no '@', treating as a regular hostname

then it will either print 

05/29/25 13:21:17.138 ipv6_getaddrinfo() could not look up condor-execute: ....

or it will append the default domain name to it and print or

05/29/25 13:21:17.138 Returning daemon name: "condor-execute.<default-domain>"

or
 
05/29/25 13:21:17.138 Failed to construct daemon name, returning NULL.


-tj


From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Friday, May 30, 2025 1:26 PM
To: John M Knoeller <johnkn@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_drain error

What steps can I take to fix that? I'm using the Execute Role on the machine I am trying to drain and I am able to run jobs on it so I'm confused about why the collector wouldn't be able to find a STARTD with the name of that machine name in the Collector.

condor_status shows the `condor-execute` machine:

root@condor-master:~# condor_status
Name                 OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@condor-execute LINUX      X86_64 Unclaimed Idle      0.000 3927  0+00:00:00

               Total Owner Claimed Unclaimed Matched Preempting  Drain Backfill BkIdle

  X86_64/LINUX     1     0       0         1       0          0      0        0      0

         Total     1     0       0         1       0          0      0        0      0

I've tried explicitly setting the MACHINE ClassAd to "condor-execute" and have restarted Condor but I get the same error from `condor_drain`.

Thanks,

Curtis

On Fri, May 30, 2025 at 10:27âAM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
Yes, that message indicates that the condor_drain command is not able to get the find a STARTD with that name in the collector when it goes to look up the address of the STARTD. 

-tj


From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Friday, May 30, 2025 12:09 PM
To: John M Knoeller <johnkn@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_drain error

I'm still unable to run condor_drain using the machine alias. Any suggestions?

On Fri, Apr 25, 2025 at 8:44âAM Curtis Spencer <curtis.spencer@xxxxxxxxxxxx> wrote:
Does that indicate that the failure is from the attempt to get the address from the collector or later on? What else can I do to debug this?

On Tue, Apr 22, 2025 at 9:47âAM Curtis Spencer <curtis.spencer@xxxxxxxxxxxx> wrote:
Running `condor_drain -debug:D_COMMAND,D_HOSTNAME,D_FULLDEBUG "condor-execute"` results in the same error: `ERROR: unknown host condor-execute`.


On Tue, Apr 22, 2025 at 7:43âAM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
We should try and figure out where the failure is happening. 

try

condor_drain -debug:D_COMMAND,D_HOSTNAME,D_FULLDEBUG "condor-execute"

To see if the failure is from the attempt to get the address from the collector, or if it is failing later on.

-tj



From: Curtis Spencer
Sent: Friday, April 18, 2025 3:59 PM
To: John M Knoeller
Cc: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] condor_drain error

Thanks. Running the command with the full address worked:


Running the command using the machine alias returns "ERROR: unknown host condor-execute"

condor_drain -debug "condor-execute"

What can I do to get the machine alias working? That's much easier to use/remember than the full address.

Thanks,

Curtis

On Fri, Apr 18, 2025 at 11:53âAM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
The <> around the address and the sock= are part of the address, and are needed. 


condor_drain -debug "condor-execute"

should also work, since "condor-execute" is the value of the Machine attribute in the collector. 


-tj



From: Curtis Spencer
Sent: Friday, April 18, 2025 12:58 PM
To: John M Knoeller
Cc: HTCondor-Users Mail List
Subject: Re: [HTCondor-users] condor_drain error

I ran that command and tried running `condor_drain`  using the entire value of MyAddress but it hung and I had to exit. I also tried with the Ip address and machine alias but got the same errors as before.

```
$ condor_status -af:h Name Machine MyAddress
Name                 Machine        MyAddress                                                                                      
slot1@condor-execute condor-execute <192.168.9.161:9618?addrs=192.168.9.161-9618&alias=condor-execute&noUDP&sock=startd_11727_5337>
$ sudo condor_drain -debug 192.168.9.161:9618?addrs=192.168.9.161-9618&alias=condor-execute&noUDP&sock=startd_11727_5337
[1] 41511
[2] 41512
[3] 41513
debian@condor-master:~$ -bash: noUDP: command not found
04/18/25 17:39:47 condor_read(): Socket closed abnormally when trying to read 5 bytes from startd 192.168.9.161:9618?addrs=192.168.9.161-9618, errno=104 Connection reset by peer
04/18/25 17:39:47 SECMAN: no classad from server, failing
04/18/25 17:39:47 ERROR: SECMAN:2011:Connection closed during command authorization. Probably due to an unknown command.
Attempt to send DRAIN_JOBS to startd <192.168.9.161:9618> failed
Failed to start DRAIN_JOBS command to 192.168.9.161:9618?addrs=192.168.9.161-9618
^C
[1]   Exit 1                  sudo condor_drain -debug 192.168.9.161:9618?addrs=192.168.9.161-9618
[2]-  Done                    alias=condor-execute
[3]+  Exit 127                noUDP
$ sudo condor_drain -debug 192.168.9.161
04/18/25 17:38:45 Can't find address for startd 192.168.9.161
ERROR: Can't find address for startd 192.168.9.161
$ sudo condor_drain -debug condor-execute
ERROR: unknown host condor-execute
```

Thanks,

Curtis

On Fri, Apr 18, 2025 at 7:17âAM John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
The HTCondor tools mostly don't use DNS to locate daemons when they want to send commands. 

The way most HTCondor tools, including condor_drain, determine the address of a daemon is by querying the collector,  so the argument you pass must be the name of the daemon in the collector. 

Try running 

condor_status -af:h Name Machine MyAddress

The MyAddress value is what condor_drain needs to lookup in order to see send the drain command,   It will look for entries in the collector where either the Name or the Machine value matches what you passed to condor_drain.    

-tj

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Thursday, April 17, 2025 3:43 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: [HTCondor-users] condor_drain error
 
I have a test condor pool with two nodes (installed using `get_htcondor`):
* Node A (hostname `condor-master`; IP 192.168.9.7) has `use role:get_htcondor_central_manager` and `use role:get_htcondor_submit` roles
* Node B (hostname `condor-execute`; IP 192.168.9.161) has `use role:get_htcondor_execute`

I ran `systemctl status condor` on both nodes to confirm that the master has condor_collector, condor_negotiator, and condor_schedd running and that the execute node has condor_startd running.

Both nodes have `use security:recommended` in `/etc/condor/config.d/00-security`.

I have run `condor_status` and confirmed that the execute machine has joined the pool and I have run `condor_submit` with a test script and confirmed that the job ran.

I have copied the `/etc/condor/passwords.d/POOL` file from the master node to the execute node and confirmed 0600 permissions for that file on both nodes.

I have run `condor_token_request` on the master node and approved the request on the execute node using `condor_token_request_approve` and stored the returned key in `/etc/condor/tokens.d/admin@condor` on the master node.

I then ran `condor_drain -debug 192.168.9.161` but got this error: "Can't find address for startd 192.168.9.161". I thought this could be related to using the IP address instead of a DNS name so I added an entry to my /etc/hosts file and ran `condor_drain -debu condor-execute` but I got this error: "ERROR: unknown host condor-execute"

Conceptually, I feel like I understand how IDTOKEN auth is supposed to work (and to some extent, I think it is working since the execute machine was able to join the pool), but I can't figure out why `condor_drain` won't work.

Thanks,

Curtis