[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster



Hi Jaime,
I tried the command you suggested, it seems to work, but I need to modify it a bit, i.e. I ran:

remote_gahp  --rgahp-user <user> <remote_cluster> blahpd

The output I got is:

Agent pid 3213418
Warning: Permanently added '<remote_cluster>' (ED25519) to the list of known hosts.
$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $
quit
S Server\ exiting
Agent pid 3213418 killed


It seems to be the same as in your example.
Is there something else I can check?

Thanks,
Vito

From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 11:27
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Vito Di Benedetto <vito@xxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
 

[EXTERNAL] â This message is from an external sender

The <user>@<remote cluster> you specify for grid_resource in the submit description should be the same values that youâd use for sshâing to the login node.

You can try running the same command that HTCondor uses to connect to the login node on the command line:

% remote_gahp <user>@<remote_cluster> blahpd
Agent pid 3946166
$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $
QUIT
S Server\ exiting
Agent pid 3946166 killed
%

remote_gahp is a shell script that runs ssh with the correct arguments to establish the network connection for HTCondor to use. You can examine exactly what itâs doing to determine why the connection is failing.

 - Jaime

On Mar 2, 2026, at 7:13âPM, Vito Di Benedetto via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Dear HTCondor development Team,
I'm trying to test a remote server where I recently got an account, the remote system is RHEL8 and it is using SLURM.
On this system I have been able to successfully submit and run test jobs interactively using SLURM.
As next step I prepared a bosco setup using condor_remote_cluster.
However when I try to test the cluster it looks like there is some issue.
In "/var/log/condor/GridManagerLog..gfactory" I see the following log message:
[...]
3/02/26 19:03:56 [2227] Trying to update collector <ip:9618?alias=hostname>
03/02/26 19:03:56 [2227] Attempting to send update via TCP to collectorhostname> <ip:9618?alias=hostname>
03/02/26 19:03:56 [2227] Gahp Server (pid=2767793) exited with status 1 unexpectedly
03/02/26 19:03:57 [2225] DaemonKeepAlive: in SendAliveToParent()
03/02/26 19:03:57 [2225] Completed DC_CHILDALIVE to daemon at <ip:28881>
03/02/26 19:03:57 [2225] DaemonKeepAlive: Leaving SendAliveToParent() - success
03/02/26 19:03:59 [2225] GAHP server pid = 2768017
03/02/26 19:03:59 [2225] GAHP[2768017] (stderr) -> Missing remote command
03/02/26 19:03:59 [2225] Failed to read GAHP server version
03/02/26 19:03:59 [2225] Error starting <remote cluster> GAHP: Missing remote command\nMissing remote command\nMissing remote command\nMissing remote command\n
03/02/26 19:03:59 [2225] resource <user>@<remote cluster> is still down
[...]

where I have redacted hostnames and IPs.

In case it matters, the remote cluster requires a VPN to be accessed.
To run the test I make sure the VPN is active and that I can login to the cluster.
When I login to the remote cluster, the IP address of the node has an IP in the local network, it is in the 172.20 range, I'm not sure if this can interfere with the bosco test.

Thank you for any help to address this issue.
Vito Di Benedetto

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/