[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster



Hi Jamie,
I think I figured out why I had that weird issue when trying remote_gahp,
the username I got on the remote cluster has a dot in it.
If I try to run remote_gahp with the username without the dot, the two syntax are equivalent, except that with the username without dot I get the permission denied, as expected.

-Vito

From: Vito Di Benedetto <vito@xxxxxxxx>
Sent: Tuesday, March 3, 2026 15:27
To: Jaime Frey <jfrey@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
 
Hi Jamie,

If I run remote_gahp as you suggested I get an error:

$ remote_gahp   <user>@<remote_cluster> blahpd
Missing remote command
Usage: /usr/sbin/remote_gahp [options] remote_hostname [options] REMOTE_CMD [remote arguments]
 /usr/sbin/remote_gahp [options] remote_hostname [remote options and arguments]
 Options: 
...

In case it matters, this comes from condor-24.0.14-1.el9.x86_64

To test bosco on the remote server I used the command below with the associated output.
$ condor_remote_cluster -t  <user>@<remote_cluster>
Testing ssh to  <user>@<remote_cluster>...Warning: Permanently added '<remote_cluster>' (ED25519) to the list of known hosts.
Warning: Permanently added ' <user>@<remote_cluster>' (ED25519) to the list of known hosts.
Passed!
Testing remote submission...Passed!
Submission and log files for this job are in /home/gfactory/bosco-test/boscotest.qRF3i
Waiting for jobmanager to accept job...Passed
Checking for submission to remote slurm cluster (could take ~30 seconds)...Failed
Showing last 5 lines of logs:
03/03/26 15:14:00 [2227] Completed DC_CHILDALIVE to daemon at <myfactory:28881>
03/03/26 15:14:00 [2227] DaemonKeepAlive: Leaving SendAliveToParent() - success
03/03/26 15:14:04 [2225] DaemonKeepAlive: in SendAliveToParent()
03/03/26 15:14:04 [2225] Completed DC_CHILDALIVE to daemon at <myfactory:28881>
03/03/26 15:14:04 [2225] DaemonKeepAlive: Leaving SendAliveToParent() - success

For the test I use the remote username, that is different from the local username (gfactory)
In the boscotest.qRF3i folder the logfile has the following detail

...
026 (030.000.000) 2026-03-03 15:13:35 Detected Down Grid Resource
    GridResource: batch slurm <user>@<remote_cluster>
...

As part of the setup I also ran:
./condor_remote_cluster_sdumont  -s <user>@<remote_cluster>

This was working, tho I had to modify it to include in the "get_status()" function the call for "squeue --me" in the for loop to take into account the SLURM case where it uses this command to get the queue status.

 Thanks,
Vito

From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 14:33
To: Vito Di Benedetto <vito@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
 

[EXTERNAL] â This message is from an external sender

The two following commands should be equivalent (where <user> is the account name on the login node):

remote_gahp  --rgahp-user <user> <remote_cluster> blahpd
remote_gahp  <user>@<remote_cluster> blahpd

The output you got indicates that the ssh connection is operating properly.
Looking back at the errors in your GridmanagerLog, I see now that the complaint is about bad arguments to the remote_gahp script when HTCondor invokes it.

How exactly did you test it initially? For the condor_remote_cluster --test command, if the username on the login node is different than your local account, youâll have to include it as part of the hostname/ip-address, like so:

condor_remote_cluster --test <user>@<cluster>

 - Jaime

On Mar 3, 2026, at 12:10âPM, Vito Di Benedetto <vito@xxxxxxxx> wrote:

Hi Jaime,
I tried the command you suggested, it seems to work, but I need to modify it a bit, i.e. I ran:

remote_gahp  --rgahp-user <user> <remote_cluster> blahpd

The output I got is:

Agent pid 3213418
Warning: Permanently added '<remote_cluster>' (ED25519) to the list of known hosts.
$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $
quit
S Server\ exiting
Agent pid 3213418 killed


It seems to be the same as in your example.
Is there something else I can check?

Thanks,
Vito

From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 11:27
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Vito Di Benedetto <vito@xxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
 
[EXTERNAL] â This message is from an external sender

The <user>@<remote cluster> you specify for grid_resource in the submit description should be the same values that youâd use for sshâing to the login node.

You can try running the same command that HTCondor uses to connect to the login node on the command line:

% remote_gahp <user>@<remote_cluster> blahpd
Agent pid 3946166
$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $
QUIT
S Server\ exiting
Agent pid 3946166 killed
%

remote_gahp is a shell script that runs ssh with the correct arguments to establish the network connection for HTCondor to use. You can examine exactly what itâs doing to determine why the connection is failing.

 - Jaime

On Mar 2, 2026, at 7:13âPM, Vito Di Benedetto via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Dear HTCondor development Team,
I'm trying to test a remote server where I recently got an account, the remote system is RHEL8 and it is using SLURM.
On this system I have been able to successfully submit and run test jobs interactively using SLURM.
As next step I prepared a bosco setup using condor_remote_cluster.
However when I try to test the cluster it looks like there is some issue.
In "/var/log/condor/GridManagerLog..gfactory" I see the following log message:
[...]
3/02/26 19:03:56 [2227] Trying to update collector <ip:9618?alias=hostname>
03/02/26 19:03:56 [2227] Attempting to send update via TCP to collectorhostname> <ip:9618?alias=hostname>
03/02/26 19:03:56 [2227] Gahp Server (pid=2767793) exited with status 1 unexpectedly
03/02/26 19:03:57 [2225] DaemonKeepAlive: in SendAliveToParent()
03/02/26 19:03:57 [2225] Completed DC_CHILDALIVE to daemon at <ip:28881>
03/02/26 19:03:57 [2225] DaemonKeepAlive: Leaving SendAliveToParent() - success
03/02/26 19:03:59 [2225] GAHP server pid = 2768017
03/02/26 19:03:59 [2225] GAHP[2768017] (stderr) -> Missing remote command
03/02/26 19:03:59 [2225] Failed to read GAHP server version
03/02/26 19:03:59 [2225] Error starting <remote cluster> GAHP: Missing remote command\nMissing remote command\nMissing remote command\nMissing remote command\n
03/02/26 19:03:59 [2225] resource <user>@<remote cluster> is still down
[...]

where I have redacted hostnames and IPs.

In case it matters, the remote cluster requires a VPN to be accessed.
To run the test I make sure the VPN is active and that I can login to the cluster.
When I login to the remote cluster, the IP address of the node has an IP in the local network, it is in the 172.20 range, I'm not sure if this can interfere with the bosco test.

Thank you for any help to address this issue.
Vito Di Benedetto

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/