|
Hi Jamie,
If I run remote_gahp as you suggested I get an error:
$ remote_gahp <user>@<remote_cluster> blahpd
Missing remote command
Usage: /usr/sbin/remote_gahp [options] remote_hostname [options] REMOTE_CMD [remote arguments]
/usr/sbin/remote_gahp [options] remote_hostname [remote options and arguments]
Options:
...
In case it matters, this comes from condor-24.0.14-1.el9.x86_64
To test bosco on the remote server I used the command below with the associated output.
$ condor_remote_cluster -t <user>@<remote_cluster>
Testing ssh to <user>@<remote_cluster>...Warning: Permanently added '<remote_cluster>' (ED25519) to the list of known hosts.
Warning: Permanently added ' <user>@<remote_cluster>' (ED25519) to the list of known hosts.
Passed!
Testing remote submission...Passed!
Submission and log files for this job are in /home/gfactory/bosco-test/boscotest.qRF3i
Waiting for jobmanager to accept job...Passed
Checking for submission to remote slurm cluster (could take ~30 seconds)...Failed
Showing last 5 lines of logs:
03/03/26 15:14:00 [2227] Completed DC_CHILDALIVE to daemon at <myfactory:28881>
03/03/26 15:14:00 [2227] DaemonKeepAlive: Leaving SendAliveToParent() - success
03/03/26 15:14:04 [2225] DaemonKeepAlive: in SendAliveToParent()
03/03/26 15:14:04 [2225] Completed DC_CHILDALIVE to daemon at <myfactory:28881>
03/03/26 15:14:04 [2225] DaemonKeepAlive: Leaving SendAliveToParent() - success
For the test I use the remote username, that is different from the local username (gfactory)
In the boscotest.qRF3i folder the logfile has the following detail
...
026 (030.000.000) 2026-03-03 15:13:35 Detected Down Grid Resource
GridResource: batch slurm <user>@<remote_cluster>
...
As part of the setup I also ran:
./condor_remote_cluster_sdumont -s <user>@<remote_cluster>
This was working, tho I had to modify it to include in the "get_status()" function the call for "squeue --me" in the for loop to take into account the SLURM case where it uses this command to get the queue status.
Thanks,
Vito
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 14:33 To: Vito Di Benedetto <vito@xxxxxxxx> Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster [EXTERNAL] â This message is from an external sender The two following commands should be equivalent (where <user> is the account name on the login node):
remote_gahp --rgahp-user <user> <remote_cluster> blahpd
remote_gahp <user>@<remote_cluster> blahpd
The output you got indicates that the ssh connection is operating properly.
Looking back at the errors in your GridmanagerLog, I see now that the complaint is about bad arguments to the remote_gahp script when HTCondor invokes it.
How exactly did you test it initially? For the condor_remote_cluster --test command, if the username on the login node is different than your local account, youâll have to include it as part of the hostname/ip-address, like so:
condor_remote_cluster --test <user>@<cluster>
- Jaime
|