[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] v24.0.4 condor_submit only works sometimes



Dear all,
thanks to a lot of effort from Jaime and Cole, we managed to
get to the real underlying issues resulting in that weird behavior!

What I did not write originally was that the submit node was
intended to be used only by an HTCondor CE running on it.

I started checking local job submissions because of errors
encountered by the latter trying to pass on its jobs...

Our debugging exercises led to several relevant findings:
  • A mistake in the Schedd code by which not all relevant
    variables are cleared between subsequent requests.
    Normally that issue would not really be a problem,
    but in this case it was, due to unusual circumstances.

  • A mistake and an omission in my own documentation
    describing how to set up a mini cluster fronted by an
    HTCondor CE. With versions < v24, things worked OK
    nonetheless, which added to the mystery...  ð

  • A mistake in htcondor-ce-condor-24.0.2 (and 24.2.0)
    that needs to be worked around by the addition of an
    extra configuration file:
# cat /etc/condor-ce/config.d/99-fix-client-auth.conf 
SEC_CLIENT_AUTHENTICATION = OPTIONAL

Hopefully this summary can be of benefit to others, cheers!



From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Maarten Litmaath via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Sunday, February 9, 2025 12:30 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: Maarten Litmaath <Maarten.Litmaath@xxxxxxx>
Subject: [HTCondor-users] v24.0.4 condor_submit only works sometimes
 
Dear HTCondor experts,
I have set up a v24.0.4 mini cluster on Alma 9 using the Admin Quick Start Guide:

https://htcondor.readthedocs.io/en/latest/getting-htcondor/admin-quick-start.html

As an unprivileged user on the Submit Node, condor_submit fails as shown:

======================================================================
[alicesgm@htc24s-ce ~]$ cat my-test.jdl 
cmd = my-test.sh
output = my-test.out.$(ClusterId)
error  = my-test.err.$(ClusterId)
log = my-test.log.$(ClusterId)
+MaxMemory = 50
queue 1
[alicesgm@htc24s-ce ~]$ condor_submit my-test.jdl 
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
[alicesgm@htc24s-ce ~]$ 
======================================================================

If I keep trying, though, eventually it works:

======================================================================
[alicesgm@htc24s-ce ~]$ for i in `seq 30`; do condor_submit my-test.jdl &&
 break; sleep 61; done &>> log-$$.txt < /dev/null &
[1] 33484
[alicesgm@htc24s-ce ~]$ tail -f log-$$.txt
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
1 job(s) submitted to cluster 19.
======================================================================

That job then runs fine, while the next job submission will fail again, etc.

There appear to be two problems here:

1) The Admin Quick Start Guide gives me a cluster that does not work.

2) Due to some bug, job submissions sometimes get through nonetheless.

Advice would be appreciated, thanks!