Dear list, just stumbled over an increased job failure rate of ATLAS jobs at our site. ATLAS is running a mixture of single-core & multi-core jobs. In order to not let multi-core jobs starve, condor_defrag runs. Looks like condor_defrag is evicting single-core jobs giving them MaxVacateTime to come to an end (DEFRAG_DRAINING_SCHEDULE = graceful): 10/18/20 19:19:53 slot1_2[33437.0]: max vacate time expired. Escalating to a fast shutdown of the job. 10/18/20 19:19:53 slot1_1[74229.0]: max vacate time expired. Escalating to a fast shutdown of the job. However, this is unwanted! It actually kills jobs here. There's probably a knob for it - but which one do I need to turn to just drain the (partitionable) slot until enough resources for the usual eight-core jobs are freed (without actively vacating running jobs from the chosen system)? Thanks, Andreas -- | Andreas Haupt | E-Mail: andreas.haupt@xxxxxxx | DESY Zeuthen | WWW: http://www-zeuthen.desy.de/~ahaupt | Platanenallee 6 | Phone: +49/33762/7-7359 | D-15738 Zeuthen | Fax: +49/33762/7-7216
Attachment:
smime.p7s
Description: S/MIME cryptographic signature