|
Hi Jamie,
to close the loop,
creating the sandbox folder in /scratch and making a symlink of it in the /proj folder where bosco is, works.
If you have any advise that could help to handle this better, I'll be glad to ear.
Thank you for your help to handle this,
Vito
From: Vito Di Benedetto <vito@xxxxxxxx>
Sent: Wednesday, March 4, 2026 12:00 To: Jaime Frey <jfrey@xxxxxxxxxxx> Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
Hi Jamie,
doing some debug, I figured out what the issue could be.
On the SLURM worker node, the home directory is different from the home directory on the login node.
The directory structure is as follow:
login node:
homedir: /prj/<project>/<user>
scratchdir: /scratch/<project>/<user>
worker node:
homedir: /prj/<project>/<user> but the /prj mount point is mounting /scratch/
So the login node and the SLURM worker nodes are sharing the /scratch volume, but not the /prj volume.
I'm wondering if it is possible to use condor_remote_cluster to install bosco in a folder that is different from the homedir,
or otherwise instruct bosco to use a different sandbox folder.
Maybe I can create the sandbox in /scratch and make a symlink of it in /prj?
Thanks,
Vito
From: Vito Di Benedetto <vito@xxxxxxxx>
Sent: Tuesday, March 3, 2026 17:14 To: Jaime Frey <jfrey@xxxxxxxxxxx> Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster
Hi Jamie,
thank you for the hint.
Updating remote_gahp as you suggested allows me to use my actual username.
It looks like now condor_remote_cluster test is sort of working, as in now I see jobs from my local queue to be held complaining with something like:
03/03/26 17:05:05 [3338543] HoldReason = "FT_GAHP at 127.0.0.1 failed to send file(s) to
<127.0.0.1:45472>: |Error: 2 total failures: first failure: reading from file /prj/neutrinos/vito.benedetto/bosco/sandbox/8380/8380a2fe/<factory>_9618_<factory>_38.0_1772578786/_condor_stdout: (errno 2) No such file or directory; GRIDMANAGER failed to receive
file(s) from <factoryIP:46364>"
Possibly there is something misconfigured.
Could this be an issue with umask? On the remote cluster its value id 0022,
file permissions for sandbox are drwxâ----, while on other system where I used bosco successfully I have umask 0007 and sandbox permissions are drwxrwx---+
Another thing I noticed is that with condor_remote_cluster -t the option -b to select a different bosco install doesn't work, tho I can use the default bosco folder for the test.
Thanks,
Vito
From: Jaime Frey <jfrey@xxxxxxxxxxx>
Sent: Tuesday, March 3, 2026 16:43 To: Vito Di Benedetto <vito@xxxxxxxx> Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] condor_remote_cluster fails to test a remote cluster [EXTERNAL] â This message is from an external sender Aha. Thereâs a bug in the remote_gahp script, where it assumes a username wonât have a dot in it.
You can fix your copy if youâre handy with a text editor. Line 106 should be changed to look like this:
if [[ $1 =~ ^(([A-Za-z_][A-Za-z0-9_.-]*)@)?([A-Za-z0-9.-]+)(:([0-9]*))?$ ]]; then
You need to add a dot to the first [A-Za-z0-9_.-] sequence in the pattern.
Weâll include the fix in an upcoming release.
- Jaime
|