Hi there, I am a developer on the GlideinWMS project and we are currently looking into implementing a blackhole detection mechanism for glideins. There had been
some conversation/discussion about this back in 2018 and I have been referring to those notes that were made available internally within my team since I’ve been working on enabling this feature in GlideinWMS. All the details I describe next are based off of
that. We have the following lines in our condor configuration:
STARTD.STATISTICS_TO_PUBLISH_LIST = $(STATISTICS_TO_PUBLISH_LIST) JobDuration, JobBusyTime
STARTD_SLOT_ATTRS = RecentJobBusyTimeAvg, RecentJobBusyTimeCount The notes seemed to convey that there are 16 attributes generated in each slot because of two statistics probes (JobDuration, JobBusyTime). While these
attributes are not published by default (due to their number), their publishing can be enabled by adding the first line in the code snippet to the configuration of the execute nodes. Having said that, as per my understanding, using the STARTD_SLOT_ATTRS should
enable two attributes per slot -- slot<N>_RecentJobBusyTimeAvg and slot<N>_RecentJobBusyTimeCount depending on the type of slot (fixed vs. partitionable). However, I do not see these two attributes in the classad when I query the classad using the command:
`condor_status -l <slot1@glidein> | grep -i “job”` on the client side. I wanted to reach out to understand if I’m missing something and/or learn if things have changed in HTCondor since 2018 (which is when the initial discussion
about the blackhole mechanism took place between GlideinWMS and HTCondor teams). If you need further information about anything that I’ve described above, please let me know and I’ll be happy to share. Looking forward to your reply. Thanks, Namratha Urs (she/her) Software Developer, Scientific Compute Services and Tools Computational Science and AI Directorate,
Fermi National Accelerator Laboratory Ph.D. Candidate, Computer Science | University of North Texas |