|
Hi Jamie,
doing some debug, I figured out what the issue could be.
On the SLURM worker node, the home directory is different from the home directory on the login node.
The directory structure is as follow:
login node:
homedir: /prj/<project>/<user>
scratchdir: /scratch/<project>/<user>
worker node:
homedir: /prj/<project>/<user> but the /prj mount point is mounting /scratch/
So the login node and the SLURM worker nodes are sharing the /scratch volume, but not the /prj volume.
I'm wondering if it is possible to use condor_remote_cluster to install bosco in a folder that is different from the homedir,
or otherwise instruct bosco to use a different sandbox folder.
Maybe I can create the sandbox in /scratch and make a symlink of it in /prj?
Thanks,
Vito
From: Vito Di Benedetto <vito@xxxxxxxx>
Sent: Tuesday, March 3, 2026 17:14
To: Jaime Frey <jfrey@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
Hi Jamie,
thank you for the hint.
Updating remote_gahp as you suggested allows me to use my actual username.
It looks like now condor_remote_cluster test is sort of working, as in now I see jobs from my local queue to be held complaining with something like:
03/03/26 17:05:05 [3338543] HoldReason = "FT_GAHP at 127.0.0.1 failed to send file(s) to <127.0.0.1:45472>:
|Error: 2 total failures: first failure: reading from file /prj/neutrinos/vito.benedetto/bosco/sandbox/8380/8380a2fe/<factory>_9618_<factory>_38.0_1772578786/_condor_stdout: (errno 2) No such file or directory; GRIDMANAGER failed to receive file(s) from <factoryIP:46364>"
Possibly there is something misconfigured.
Could this be an issue with umask? On the remote cluster its value id 0022,
file permissions for sandbox are drwxâ----, while on other system where I used bosco successfully I have umask 0007 and sandbox permissions are drwxrwx---+
Another thing I noticed is that with condor_remote_cluster -t the option -b to select a different bosco install doesn't work, tho I can use the default bosco folder for the test.
Thanks,
Vito
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 16:43
To: Vito Di Benedetto <vito@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
[EXTERNAL] â This message is from an external sender
Aha. Thereâs a bug in the remote_gahp script, where it assumes a username wonât have a dot in it.
You can fix your copy if youâre handy with a text editor. Line 106 should be changed to look like this:
if [[ $1 =~ ^(([A-Za-z_][A-Za-z0-9_.-]*)@)?([A-Za-z0-9.-]+)(:([0-9]*))?$ ]]; then
You need to add a dot to the first [A-Za-z0-9_.-] sequence in the pattern.
Weâll include the fix in an upcoming release.
- Jaime
On Mar 3, 2026, at 4:24âPM, Vito Di Benedetto <vito@xxxxxxxx> wrote:
Hi Jamie,
I think I figured out why I had that weird issue when trying remote_gahp,
the username I got on the remote cluster has a dot in it.
If I try to run remote_gahp with the username without the dot, the two syntax are equivalent, except that with the username without dot I get the permission denied, as expected.
-Vito
From: Vito Di Benedetto <vito@xxxxxxxx>
Sent: Tuesday, March 3, 2026 15:27
To: Jaime Frey <jfrey@xxxxxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
Hi Jamie,
If I run remote_gahp as you suggested I get an error:
$ remote_gahp <user>@<remote_cluster> blahpd
Missing remote command
Usage: /usr/sbin/remote_gahp [options] remote_hostname [options] REMOTE_CMD [remote arguments]
/usr/sbin/remote_gahp [options] remote_hostname [remote options and arguments]
Options:
...
In case it matters, this comes from condor-24.0.14-1.el9.x86_64
To test bosco on the remote server I used the command below with the associated output.
$ condor_remote_cluster -t <user>@<remote_cluster>
Testing ssh to <user>@<remote_cluster>...Warning: Permanently added '<remote_cluster>' (ED25519) to the list of known hosts.
Warning: Permanently added ' <user>@<remote_cluster>' (ED25519) to the list of known hosts.
Passed!
Testing remote submission...Passed!
Submission and log files for this job are in /home/gfactory/bosco-test/boscotest.qRF3i
Waiting for jobmanager to accept job...Passed
Checking for submission to remote slurm cluster (could take ~30 seconds)...Failed
Showing last 5 lines of logs:
03/03/26 15:14:00 [2227] Completed DC_CHILDALIVE to daemon at <myfactory:28881>
03/03/26 15:14:00 [2227] DaemonKeepAlive: Leaving SendAliveToParent() - success
03/03/26 15:14:04 [2225] DaemonKeepAlive: in SendAliveToParent()
03/03/26 15:14:04 [2225] Completed DC_CHILDALIVE to daemon at <myfactory:28881>
03/03/26 15:14:04 [2225] DaemonKeepAlive: Leaving SendAliveToParent() - success
For the test I use the remote username, that is different from the local username (gfactory)
In the boscotest.qRF3i folder the logfile has the following detail
...
026 (030.000.000) 2026-03-03 15:13:35 Detected Down Grid Resource
GridResource: batch slurm <user>@<remote_cluster>
...
As part of the setup I also ran:
./condor_remote_cluster_sdumont -s <user>@<remote_cluster>
This was working, tho I had to modify it to include in the "get_status()" function the call for "squeue --me" in the for loop to take into account the SLURM case where it uses this command to get the queue status.
Thanks,
Vito
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 14:33
To: Vito Di Benedetto <vito@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
[EXTERNAL] â This message is from an external sender
The two following commands should be equivalent (where <user> is the account name on the login node):
remote_gahp --rgahp-user <user> <remote_cluster> blahpd
remote_gahp <user>@<remote_cluster> blahpd
The output you got indicates that the ssh connection is operating properly.
Looking back at the errors in your GridmanagerLog, I see now that the complaint is about bad arguments to the remote_gahp script when HTCondor invokes it.
How exactly did you test it initially? For the condor_remote_cluster --test command, if the username on the login node is different than your local account, youâll have to include it as part of the hostname/ip-address, like so:
condor_remote_cluster --test <user>@<cluster>
- Jaime
On Mar 3, 2026, at 12:10âPM, Vito Di Benedetto <vito@xxxxxxxx> wrote:
Hi Jaime,
I tried the command you suggested, it seems to work, but I need to modify it a bit, i.e. I ran:
remote_gahp --rgahp-user <user> <remote_cluster> blahpd
The output I got is:
Agent pid 3213418
Warning: Permanently added '<remote_cluster>' (ED25519) to the list of known hosts.
$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $
quit
S Server\ exiting
Agent pid 3213418 killed
It seems to be the same as in your example.
Is there something else I can check?
Thanks,
Vito
[EXTERNAL] â This message is from an external sender
The <user>@<remote cluster> you specify for grid_resource in the submit description should be the same values that youâd use for sshâing to the login node.
You can try running the same command that HTCondor uses to connect to the login node on the command line:
% remote_gahp <user>@<remote_cluster> blahpd
Agent pid 3946166
$GahpVersion: 1.8.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $
QUIT
S Server\ exiting
Agent pid 3946166 killed
%
remote_gahp is a shell script that runs ssh with the correct arguments to establish the network connection for HTCondor to use. You can examine exactly what itâs doing to determine why the connection is failing.
- Jaime
On Mar 2, 2026, at 7:13âPM, Vito Di Benedetto via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
Dear HTCondor development Team,
I'm trying to test a remote server where I recently got an account, the remote system is RHEL8 and it is using SLURM.
On this system I have been able to successfully submit and run test jobs interactively using SLURM.
As next step I prepared a bosco setup using condor_remote_cluster.
However when I try to test the cluster it looks like there is some issue.
In "/var/log/condor/GridManagerLog..gfactory" I see the following log message:
[...]
3/02/26 19:03:56 [2227] Trying to update collector <ip:9618?alias=hostname>
03/02/26 19:03:56 [2227] Attempting to send update via TCP to collectorhostname> <ip:9618?alias=hostname>
03/02/26 19:03:56 [2227] Gahp Server (pid=2767793) exited with status 1 unexpectedly
03/02/26 19:03:57 [2225] DaemonKeepAlive: in SendAliveToParent()
03/02/26 19:03:57 [2225] Completed DC_CHILDALIVE to daemon at <ip:28881>
03/02/26 19:03:57 [2225] DaemonKeepAlive: Leaving SendAliveToParent() - success
03/02/26 19:03:59 [2225] GAHP server pid = 2768017
03/02/26 19:03:59 [2225] GAHP[2768017] (stderr) -> Missing remote command
03/02/26 19:03:59 [2225] Failed to read GAHP server version
03/02/26 19:03:59 [2225] Error starting <remote cluster> GAHP: Missing remote command\nMissing remote command\nMissing remote command\nMissing remote command\n
03/02/26 19:03:59 [2225] resource <user>@<remote cluster> is still down
[...]
where I have redacted hostnames and IPs.
In case it matters, the remote cluster requires a VPN to be accessed.
To run the test I make sure the VPN is active and that I can login to the cluster.
When I login to the remote cluster, the IP address of the node has an IP in the local network, it is in the 172.20 range, I'm not sure if this can interfere with the bosco test.
Thank you for any help to address this issue.
Vito Di Benedetto
_______________________________________________
HTCondor-users
mailing list
To
unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
a
subject:
Unsubscribe
The
archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/
|