[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Special characters within paths from queueing from file



And... I was wrong. It actually doesn't work. It seemed like it was working at first, but upon further job reviewing, you are right. The classads from the parallel universe doesn't seem to allow us to build the hostfile correctly.

Back to the drawing board it is! 

Martin

-----Original Message-----
From: Beaumont, Martin 
Sent: May 6, 2026 4:22 PM
To: htcondor-users@xxxxxxxxxxx
Subject: RE: [HTCondor-users] Special characters within paths from queueing from file

Hi Zach,

Ah... well... It is working here, after fixing the paths name by removing the + signs.

We did try your method, but it kept making condor unstable from the number of jobs that created (back on version 9; unsure if it would still have that effect with version 25).
We noticed that Condor seems to much prefer jobs clusters, instead of singles. Which is why we were trying with a single .sub.

If you're interested, I could explain how it runs here, but from my experience with parallel jobs, not one works the same. So you might have a case where it wouldn't work anyway.
TLDR: We generate all the jobs first (using a script), each in their own folder, then we add all those folders in the jobsList.txt for a single .sub to queue from. All the jobs are working directly within the user's /home over NFS, instead of locally on the diskless execute nodes.

As for SELinux, not much to be honest. Other than my improvised .te file, and the condor_tcp_network_connect you already mentioned, I also set nis_enabled to 1. At some point, I also had to let restorecond fix labels in /home, but I don't think that's necessary anymore.
Also, SELinux isn't enforced on the execute nodes, only the mgnt node, which probably simplifies things a lot.

Martin

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Zach McGrew
Sent: May 6, 2026 12:13 PM
To: htcondor-users@xxxxxxxxxxx
Subject: Re: [HTCondor-users] Special characters within paths from queueing from file

Hi Martin,

I ran into this issue a while back with some of my users here as well. The issue is that you can't queue multiple jobs to the parallel universe from a single submission file. Each "job" that gets queued from the file describes a requirement for one or more EPs and not another job in that batch of jobs. This is mentioned in the MPI section of the documentation [1], but easy to miss. I can see the use case they were going for, but I do agree it's not super convenient to submit a ton of parallel universe jobs. The only workaround I've used so far is a tiny shell script to invoke multiple submissions to adjust the submission variables as needed (using the -a to append and overwrite whatever was already in the .sub):

for i in $(seq 1 10) ; do
  condor_submit my_parallel.sub -a arguments="$i"
done

I haven't use the job-sets [2] feature yet, but that might also fit this use case well. If not, I could definitely see writing something with the Python API allowing for more flexibility and maintainability though.

On the SELinux note, high-five! I do the same. A quick search through my Puppet codebase shows I set the 'condor_tcp_network_connect' bool to yes, then have some custom rules to allow the sshd to work for condor_ssh_to_job, and allow that ssh server to create new connections (socks-proxy; we had a very unique use case on one node), set some fcontexts on the separate scratch disk path, and one more to enable HTCondor to talk to esmtp to send email notifications. What are you setting?

-Zach

Reference URLs:
1. https://htcondor.readthedocs.io/en/latest/users-manual/env-of-job.html#differing-requirements-for-the-machines
2. https://htcondor.readthedocs.io/en/latest/users-manual/job-sets.html

________________________________________
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Beaumont, Martin <Martin.Beaumont@xxxxxxxxxxxxxxx>
Sent: Wednesday, May 6, 2026 7:27 AM
To: htcondor-users@xxxxxxxxxxx
Subject: [HTCondor-users] Special characters within paths from queueing from    file

Hello,

This is more of a FYI, but if you think this is a bug or at least would require some better error handling, here's what a user of mine made me loose some more hair from.

This is on HTCondor 25.8.2, which I upgraded from 9.0.17 while trying to fix this.

He was trying to queue several mpi jobs from a single .sub, by using "queue jobsList from jobsList.txt".
Submitting was a success, but when it came to matching, it always ended as "no match found" without more explanation, even with logs in verbose mode.
The compute nodes are configured in DedicatedScheduler with auto partitionable slots, and the headnode has pre-emption configured to accelerate matchmaking. Nothing fancy.

Here's an example of a condor_q --better-analyse:

The Requirements expression for job 63.000 is

    (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) &&
    (TARGET.Memory >= RequestMemory) && (TARGET.Cpus >= RequestCpus) && (TARGET.HasFileTransfer)

    [0]    : TARGET.Arch == "X86_64"
    [1]    : TARGET.OpSys == "LINUX"
    [2]    : [0] && [1]
    [3]    : TARGET.Disk >= RequestDisk
    [4]    : [2] && [3]
    [5]    : TARGET.Memory >= RequestMemory
    [6]    : [4] && [5]
    [7]    : TARGET.Cpus >= RequestCpus
    [8]    : [6] && [7]
    [9]    : TARGET.HasFileTransfer
    [10]   : [8] && [9]

Job 63.000 defines the following attributes:

    RequestCpus = 64
    RequestDisk = MAX({ 1024,(TransferInputSizeMB + 1) * 1.25 }) * 1024 (kb)
    RequestMemory = 65536 (mb)
    TransferInputSizeMB = 4

The Requirements expression for job 63.000 reduces to these conditions:

        Slots
Step   Matched  Condition
----- --------- ---------
[0]          23  TARGET.Arch == "X86_64"
[1]          23  TARGET.OpSys == "LINUX"
[3]          23  TARGET.Disk >= RequestDisk
[5]          23  TARGET.Memory >= RequestMemory
[7]          23  TARGET.Cpus >= RequestCpus
[9]          23  TARGET.HasFileTransfer


063.000:  Run analysis summary ignoring user priority.  Of 23 slots on 23 machines,
      0 slots are rejected by your job's requirements
      0 slots reject your job because of their own requirements
     23 slots match and are willing to run your job

No successful match recorded.
Last failed match: Wed May  6 09:07:26 2026 Reason for last match failure: no match found


The problem was that the paths in jobsList.txt included several "+" ....
Something like: /home/user/mainjob/job+xconfig+yconfig+zconfig

Unsure if this is normal behavior, but the fact that condor_submit didn't catch it or that the system's logs didn't say why no match was found, is why I'm making this email.
But, if there's a way for condor to handle those "+" with some better quoting, let me know.


On a side note, while I upgraded condor, I noticed the file /usr/share/condor/htcondor.pp, which I'm not sure if it was a thing back in version 9.
Yes, I have SELinux enabled. I used to make my own .te from testing and checking the prevention notices one by one. (pain) So, as a suggestion, it'd be nice if, during installation or upgrades of the condor package, it would automatically detect if SELinux is enforced and apply your .pp.
(That sounded way more wrong than it should...)


Martin


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe

The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/