On Feb 9, 2024, at 4:55âAM, Thomas Hartmann
<thomas.hartmann@xxxxxxx> wrote:
Hi all,
we have occassionaly jobs coming in via our EL7/9.0.15 CEs from one
of our supported VOs. They recently switched one of their submission
nodes to EL9/Condor23 and now sometimes jobs coming in from this
submit node are staying idle in our Condor cluster.
While so far only jobs from this submission node seem to be
affected, only a subsection of all jobs from this submitter are
problematic with other jobs starting without problems. So far we
have not found an obvious difference between starting/running jobs
and jobs seemingly stuck in idle, so that our first suspicion of an
EL7/Condor9 vs. EL9/Condor23 issue did not hold.
Affected jobs seem to have all
 'numjobmatches==1 && jobstatus==1'
as ad states, i.e., all got matched once.
We increased our logging on the CE & Condor entry points and central
managers to `ALL_DEBUG = D_FULLDEBUG` but so far obvious hints on
why these jobs stay idle and are not re-matched are sparse.
On the active central manager, such jobs have a matching attempt
logged like [1] where the target execution point's startd (dynamic
slots) seem to just rejects the job. Afterwards, there seem to be no
further matching attempts.
In the rejecting worker's logs there are no hints of the affected
cluster id, so I have no good idea why the worker did not accept the
job (I am a bit hesitant to increase logging on all execution points).
In principle, the jobs are matchable and better-analyze looks good
[2] with our execution points nominally willing to run them.
Maybe someone has an idea, why these once matched & rejected jobs,
i.e., numjobmatches==1, are not matched again?
Package versions are as of [3a,b] for the CE and the worker (not in
sync due to reasons...)
Cheers,
 Thomas
[1]
NegotiatorLog:02/09/24 08:52:52ÂÂÂÂ Request 19458678.00000:
autocluster 723303 (request count 1 of 2)
NegotiatorLog:02/09/24 08:52:52ÂÂÂÂÂÂ Matched 19458678.0
group_ATLAS.atlasprd000@xxxxxxx
<131.169.223.129:9620?addrs=131.169.223.129-9620+[2001-638-700-10df--1-81]-9620&alias=grid-htcondorce0.desy.de&noUDP&sock=schedd_1587_20e3>
preempting none
<131.169.161.162:9620?addrs=131.169.161.162-9620+[2001-638-700-10a0--1-1a2]-9620&alias=batch0558.desy.de&noUDP&sock=startd_3590_0516>
slot1@xxxxxxxxxxxxxxxxx
NegotiatorLog:02/09/24 08:52:52ÂÂÂÂ Request 19458678.00000:
autocluster 723303 (request count 2 of 2)
NegotiatorLog:02/09/24 08:52:52ÂÂÂÂÂÂ Rejected 19458678.0
group_ATLAS.atlasprd000@xxxxxxx
<131.169.223.129:9620?addrs=131.169.223.129-9620+[2001-638-700-10df--1-81]-9620&alias=grid-htcondorce0.desy.de&noUDP&sock=schedd_1587_20e3>:
no match found
MatchLog:02/09/24 08:52:52ÂÂÂÂÂÂ Matched 19458678.0
group_ATLAS.atlasprd000@xxxxxxx
<131.169.223.129:9620?addrs=131.169.223.129-9620+[2001-638-700-10df--1-81]-9620&alias=grid-htcondorce0.desy.de&noUDP&sock=schedd_1587_20e3>
preempting none
<131.169.161.162:9620?addrs=131.169.161.162-9620+[2001-638-700-10a0--1-1a2]-9620&alias=batch0558.desy.de&noUDP&sock=startd_3590_0516>
slot1@xxxxxxxxxxxxxxxxx
MatchLog:02/09/24 08:52:52ÂÂÂÂÂÂ Rejected 19458678.0
group_ATLAS.atlasprd000@xxxxxxx
<131.169.223.129:9620?addrs=131.169.223.129-9620+[2001-638-700-10df--1-81]-9620&alias=grid-htcondorce0.desy.de&noUDP&sock=schedd_1587_20e3>:
no match found
[2]
-- Schedd: grid-htcondorce0.desy.de : <131.169.223.129:4792?...
The Requirements expression for job 19458678.000 is
ÂÂÂ NODE_IS_HEALTHY && ifThenElse(x509UserProxyVOName is
"desy",TEST_RESOURCE == true,GRID_RESOURCE == true) && (OpSysAndVer
== "CentOS7") &&
ÂÂÂ ifThenElse((x509UserProxyVOName isnt "desy") &&
(x509UserProxyVOName isnt "ops") && (x509UserProxyVOName isnt
"calice") &&
ÂÂÂÂÂ (x509UserProxyVOName isnt "belle"),(OLD_RESOURCE ==
false),(OLD_RESOURCE == false) || (OLD_RESOURCE == true)) &&
ifThenElse((x509UserProxyVOName isnt "desy") &&
ÂÂÂÂÂ (x509UserProxyVOName isnt "ops") && (x509UserProxyVOName isnt
"belle"),(BELLECALIBRATION_RESOURCE ==
false),(BELLECALIBRATION_RESOURCE is false) ||
ÂÂÂÂÂ (BELLECALIBRATION_RESOURCE is true))
Job 19458678.000 defines the following attributes:
ÂÂÂ x509UserProxyVOName = "atlas"
The Requirements expression for job 19458678.000 reduces to these
conditions:
ÂÂÂÂÂÂÂÂ Slots
Step Matched Condition
-----Â --------Â ---------
[0]ÂÂÂÂÂÂÂ 9634Â NODE_IS_HEALTHY
[1]ÂÂÂÂÂÂÂ 9634Â ifThenElse(x509UserProxyVOName is
"desy",TEST_RESOURCE == true,GRID_RESOURCE == true)
[3]ÂÂÂÂÂÂÂ 9634Â OpSysAndVer == "CentOS7"
[5]ÂÂÂÂÂÂÂ 9634Â ifThenElse((x509UserProxyVOName isnt "desy") &&
(x509UserProxyVOName isnt "ops") && (x509UserProxyVOName isnt
"calice") && (x509UserProxyVOName isnt "belle"),(OLD_RESOURCE ==
false),(OLD_RESOURCE == false) || (OLD_RESOURCE == true))
[7]ÂÂÂÂÂÂÂ 9634Â ifThenElse((x509UserProxyVOName isnt "desy") &&
(x509UserProxyVOName isnt "ops") && (x509UserProxyVOName isnt
"belle"),(BELLECALIBRATION_RESOURCE ==
false),(BELLECALIBRATION_RESOURCE is false) ||
(BELLECALIBRATION_RESOURCE is true))
19458678.000:Â Job has been matched.
Last successful match: Fri Feb 9 08:52:52 2024
19458678.000:Â Run analysis summary ignoring user priority. Of 359
machines,
ÂÂÂÂÂ 0 are rejected by your job's requirements
ÂÂÂÂ 17 reject your job because of their own requirements
ÂÂÂÂÂ 0 match and are already running your jobs
ÂÂÂÂÂ 0 match but are serving other users
ÂÂÂ 342 are able to run your job
[3.a - CE Entry Point]
condor-9.0.15-1.el7.x86_64
condor-boinc-7.16.16-1.el7.x86_64
condor-classads-9.0.15-1.el7.x86_64
condor-externals-9.0.15-1.el7.x86_64
condor-procd-9.0.15-1.el7.x86_64
htcondor-ce-5.1.5-1.el7.noarch
htcondor-ce-apel-5.1.5-1.el7.noarch
htcondor-ce-bdii-5.1.3-1.el7.noarch
htcondor-ce-client-5.1.5-1.el7.noarch
htcondor-ce-condor-5.1.5-1.el7.noarch
htcondor-ce-view-5.1.5-1.el7.noarch
python2-condor-9.0.15-1.el7.x86_64
python3-condor-9.0.15-1.el7.x86_64
[3.b - Execution Point]
condor-9.0.8-1.el7.x86_64
condor-boinc-7.16.16-1.el7.x86_64
condor-classads-9.0.8-1.el7.x86_64
condor-externals-9.0.8-1.el7.x86_64
condor-procd-9.0.8-1.el7.x86_64
htcondor-ce-client-5.1.3-1.el7.noarch
python2-condor-9.0.8-1.el7.x86_64
python3-condor-9.0.8-1.el7.x86_64<ce-6599465.0_lrms-19458678.0.ce><ce-6599465.0_lrms-19458678.0.lrms>_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/