Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster

[EXTERNAL] â This message is from an external sender

The <user>@<remote cluster> you specify for grid_resource in the submit description should be the same values that youâd use for sshâing to the login node.

You can try running the same command that HTCondor uses to connect to the login node on the command line:

% remote_gahp <user>@<remote_cluster> blahpd

Agent pid 3946166

$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $

QUIT

S Server\ exiting

Agent pid 3946166 killed

remote_gahp is a shell script that runs ssh with the correct arguments to establish the network connection for HTCondor to use. You can examine exactly what itâs doing to determine why the connection is failing.

- Jaime

On Mar 2, 2026, at 7:13âPM, Vito Di Benedetto via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Dear HTCondor development Team,

I'm trying to test a remote server where I recently got an account, the remote system is RHEL8 and it is using SLURM.

On this system I have been able to successfully submit and run test jobs interactively using SLURM.

As next step I prepared a bosco setup using condor_remote_cluster.

However when I try to test the cluster it looks like there is some issue.

In "/var/log/condor/GridManagerLog..gfactory" I see the following log message:

[...]

3/02/26 19:03:56 [2227] Trying to update collector <ip:9618?alias=hostname>

03/02/26 19:03:56 [2227] Attempting to send update via TCP to collectorhostname> <ip:9618?alias=hostname>

03/02/26 19:03:56 [2227] Gahp Server (pid=2767793) exited with status 1 unexpectedly

03/02/26 19:03:57 [2225] DaemonKeepAlive: in SendAliveToParent()

03/02/26 19:03:57 [2225] Completed DC_CHILDALIVE to daemon at <ip:28881>

03/02/26 19:03:57 [2225] DaemonKeepAlive: Leaving SendAliveToParent() - success

03/02/26 19:03:59 [2225] GAHP server pid = 2768017

03/02/26 19:03:59 [2225] GAHP[2768017] (stderr) -> Missing remote command

03/02/26 19:03:59 [2225] Failed to read GAHP server version

03/02/26 19:03:59 [2225] Error starting <remote cluster> GAHP: Missing remote command\nMissing remote command\nMissing remote command\nMissing remote command\n

03/02/26 19:03:59 [2225] resource <user>@<remote cluster> is still down

[...]

where I have redacted hostnames and IPs.

In case it matters, the remote cluster requires a VPN to be accessed.

To run the test I make sure the VPN is active and that I can login to the cluster.

When I login to the remote cluster, the IP address of the node has an IP in the local network, it is in the 172.20 range, I'm not sure if this can interfere with the bosco test.

Thank you for any help to address this issue.

Vito Di Benedetto

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/