Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster

Hi Jamie,

thank you for the hint.

Updating remote_gahp as you suggested allows me to use my actual username.

It looks like now condor_remote_cluster test is sort of working, as in now I see jobs from my local queue to be held complaining with something like:

03/03/26 17:05:05 [3338543] HoldReason = "FT_GAHP at 127.0.0.1 failed to send file(s) to <127.0.0.1:45472>: |Error: 2 total failures: first failure: reading from file /prj/neutrinos/vito.benedetto/bosco/sandbox/8380/8380a2fe/<factory>_9618_<factory>_38.0_1772578786/_condor_stdout: (errno 2) No such file or directory; GRIDMANAGER failed to receive file(s) from <factoryIP:46364>"

Possibly there is something misconfigured.

Could this be an issue with umask? On the remote cluster its value id 0022,

file permissions for sandbox are drwxâ----, while on other system where I used bosco successfully I have umask 0007 and sandbox permissions are drwxrwx---+

Another thing I noticed is that with condor_remote_cluster -t the option -b to select a different bosco install doesn't work, tho I can use the default bosco folder for the test.

Thanks,

Vito

From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 16:43
To: Vito Di Benedetto <vito@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster

[EXTERNAL] â This message is from an external sender

Aha. Thereâs a bug in the remote_gahp script, where it assumes a username wonât have a dot in it.

You can fix your copy if youâre handy with a text editor. Line 106 should be changed to look like this:

if [[ $1 =~ ^(([A-Za-z_][A-Za-z0-9_.-]*)@)?([A-Za-z0-9.-]+)(:([0-9]*))?$ ]]; then

You need to add a dot to the first [A-Za-z0-9_.-] sequence in the pattern.

Weâll include the fix in an upcoming release.

- Jaime

On Mar 3, 2026, at 4:24âPM, Vito Di Benedetto <vito@xxxxxxxx> wrote:

Hi Jamie,

I think I figured out why I had that weird issue when trying remote_gahp,

the username I got on the remote cluster has a dot in it.

If I try to run remote_gahp with the username without the dot, the two syntax are equivalent, except that with the username without dot I get the permission denied, as expected.

-Vito

From: Vito Di Benedetto <vito@xxxxxxxx>
Sent: Tuesday, March 3, 2026 15:27
To: Jaime Frey <jfrey@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster

Hi Jamie,

If I run remote_gahp as you suggested I get an error:

$ remote_gahp <user>@<remote_cluster> blahpd

Missing remote command

Usage: /usr/sbin/remote_gahp [options] remote_hostname [options] REMOTE_CMD [remote arguments]

/usr/sbin/remote_gahp [options] remote_hostname [remote options and arguments]

Options:

...

In case it matters, this comes from condor-24.0.14-1.el9.x86_64

To test bosco on the remote server I used the command below with the associated output.

$ condor_remote_cluster -t <user>@<remote_cluster>

Testing ssh to <user>@<remote_cluster>...Warning: Permanently added '<remote_cluster>' (ED25519) to the list of known hosts.

Warning: Permanently added ' <user>@<remote_cluster>' (ED25519) to the list of known hosts.

Passed!

Testing remote submission...Passed!

Submission and log files for this job are in /home/gfactory/bosco-test/boscotest.qRF3i

Waiting for jobmanager to accept job...Passed

Checking for submission to remote slurm cluster (could take ~30 seconds)...Failed

Showing last 5 lines of logs:

03/03/26 15:14:00 [2227] Completed DC_CHILDALIVE to daemon at <myfactory:28881>

03/03/26 15:14:00 [2227] DaemonKeepAlive: Leaving SendAliveToParent() - success

03/03/26 15:14:04 [2225] DaemonKeepAlive: in SendAliveToParent()

03/03/26 15:14:04 [2225] Completed DC_CHILDALIVE to daemon at <myfactory:28881>

03/03/26 15:14:04 [2225] DaemonKeepAlive: Leaving SendAliveToParent() - success

For the test I use the remote username, that is different from the local username (gfactory)

In the boscotest.qRF3i folder the logfile has the following detail

...

026 (030.000.000) 2026-03-03 15:13:35 Detected Down Grid Resource

GridResource: batch slurm <user>@<remote_cluster>

...

As part of the setup I also ran:

./condor_remote_cluster_sdumont -s <user>@<remote_cluster>

This was working, tho I had to modify it to include in the "get_status()" function the call for "squeue --me" in the for loop to take into account the SLURM case where it uses this command to get the queue status.

Thanks,

Vito

From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 14:33
To: Vito Di Benedetto <vito@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster

[EXTERNAL] â This message is from an external sender

The two following commands should be equivalent (where <user> is the account name on the login node):

remote_gahp --rgahp-user <user> <remote_cluster> blahpd

remote_gahp <user>@<remote_cluster> blahpd

The output you got indicates that the ssh connection is operating properly.

Looking back at the errors in your GridmanagerLog, I see now that the complaint is about bad arguments to the remote_gahp script when HTCondor invokes it.

How exactly did you test it initially? For the condor_remote_cluster --test command, if the username on the login node is different than your local account, youâll have to include it as part of the hostname/ip-address, like so:

condor_remote_cluster --test <user>@<cluster>

- Jaime

On Mar 3, 2026, at 12:10âPM, Vito Di Benedetto <vito@xxxxxxxx> wrote:

Hi Jaime,

I tried the command you suggested, it seems to work, but I need to modify it a bit, i.e. I ran:

remote_gahp --rgahp-user <user> <remote_cluster> blahpd

The output I got is:

Agent pid 3213418

Warning: Permanently added '<remote_cluster>' (ED25519) to the list of known hosts.

$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $

quit

S Server\ exiting

Agent pid 3213418 killed

It seems to be the same as in your example.

Is there something else I can check?

Thanks,

Vito

From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 11:27
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Vito Di Benedetto <vito@xxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster

[EXTERNAL] â This message is from an external sender

The <user>@<remote cluster> you specify for grid_resource in the submit description should be the same values that youâd use for sshâing to the login node.

You can try running the same command that HTCondor uses to connect to the login node on the command line:

% remote_gahp <user>@<remote_cluster> blahpd

Agent pid 3946166

$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $

QUIT

S Server\ exiting

Agent pid 3946166 killed

%

remote_gahp is a shell script that runs ssh with the correct arguments to establish the network connection for HTCondor to use. You can examine exactly what itâs doing to determine why the connection is failing.

- Jaime

On Mar 2, 2026, at 7:13âPM, Vito Di Benedetto via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:

Dear HTCondor development Team,

I'm trying to test a remote server where I recently got an account, the remote system is RHEL8 and it is using SLURM.

On this system I have been able to successfully submit and run test jobs interactively using SLURM.

As next step I prepared a bosco setup using condor_remote_cluster.

However when I try to test the cluster it looks like there is some issue.

In "/var/log/condor/GridManagerLog..gfactory" I see the following log message:

[...]

3/02/26 19:03:56 [2227] Trying to update collector <ip:9618?alias=hostname>

03/02/26 19:03:56 [2227] Attempting to send update via TCP to collectorhostname> <ip:9618?alias=hostname>

03/02/26 19:03:56 [2227] Gahp Server (pid=2767793) exited with status 1 unexpectedly

03/02/26 19:03:57 [2225] DaemonKeepAlive: in SendAliveToParent()

03/02/26 19:03:57 [2225] Completed DC_CHILDALIVE to daemon at <ip:28881>

03/02/26 19:03:57 [2225] DaemonKeepAlive: Leaving SendAliveToParent() - success

03/02/26 19:03:59 [2225] GAHP server pid = 2768017

03/02/26 19:03:59 [2225] GAHP[2768017] (stderr) -> Missing remote command

03/02/26 19:03:59 [2225] Failed to read GAHP server version

03/02/26 19:03:59 [2225] Error starting <remote cluster> GAHP: Missing remote command\nMissing remote command\nMissing remote command\nMissing remote command\n

03/02/26 19:03:59 [2225] resource <user>@<remote cluster> is still down

[...]

where I have redacted hostnames and IPs.

In case it matters, the remote cluster requires a VPN to be accessed.

To run the test I make sure the VPN is active and that I can login to the cluster.

When I login to the remote cluster, the IP address of the node has an IP in the local network, it is in the 172.20 range, I'm not sure if this can interfere with the bosco test.

Thank you for any help to address this issue.

Vito Di Benedetto

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/