Dear all,
thanks to a lot of effort from Jaime and Cole, we managed to
get to the real underlying issues resulting in that weird behavior!
What I did not write originally was that the submit node was
intended to be used only by an HTCondor CE running on it.
I started checking local job submissions because of errors
encountered by the latter trying to pass on its jobs...
Our debugging exercises led to several relevant findings:
# cat /etc/condor-ce/config.d/99-fix-client-auth.conf
SEC_CLIENT_AUTHENTICATION = OPTIONAL
Hopefully this summary can be of benefit to others, cheers!
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Maarten Litmaath via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Sunday, February 9, 2025 12:30 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: Maarten Litmaath <Maarten.Litmaath@xxxxxxx> Subject: [HTCondor-users] v24.0.4 condor_submit only works sometimes
Dear HTCondor experts,
I have set up a v24.0.4 mini cluster on Alma 9 using the Admin Quick Start Guide:
As an unprivileged user on the Submit Node, condor_submit fails as shown:
======================================================================
[alicesgm@htc24s-ce ~]$ cat my-test.jdl
cmd = my-test.sh
output = my-test.out.$(ClusterId)
error = my-test.err.$(ClusterId)
log = my-test.log.$(ClusterId)
+MaxMemory = 50
queue 1
[alicesgm@htc24s-ce ~]$ condor_submit my-test.jdl
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
[alicesgm@htc24s-ce ~]$
======================================================================
If I keep trying, though, eventually it works:
======================================================================
[alicesgm@htc24s-ce ~]$ for i in `seq 30`; do condor_submit my-test.jdl &&
break; sleep 61; done &>> log-$$.txt < /dev/null &
[1] 33484
[alicesgm@htc24s-ce ~]$ tail -f log-$$.txt
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
ERROR: Failed to commit job submission into the queue.
ERROR: Failed to create new User record for condor@xxxxxxxx
Submitting job(s).
1 job(s) submitted to cluster 19.
======================================================================
That job then runs fine, while the next job submission will fail again, etc.
There appear to be two problems here:
1) The Admin Quick Start Guide gives me a cluster that does not work.
2) Due to some bug, job submissions sometimes get through nonetheless.
Advice would be appreciated, thanks!
|