Hi list, we are in the middle of migrating our infrastructure from CentOS 7 to Alma Linux 9. Most of the infra is on CC7 with condor 9.* and we have one testing wn cluster with AL9 on condor 23. So far the ATLAS workload works fine on new cluster, but the ALICE jobs land on the WN and fail right a way without producing any output. One possible cause: 06/14/24 09:39:54 (pid:1) Unexpected permissions failure in setting hard limit for max core sizesetrlimit(4, new = [rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615]) : old = [rlim_cur = 0, rlim_max = 0], errno: 1(Operation not permitted). Attempting workaround. 06/14/24 09:39:54 (pid:1) Workaround not applicable, no hard limit enforcement for max core size. Disabled core dumps for the condor service (our local hack to prevent our local users to plague the FS with core files). But even ps after setting the testing WN to allow core dumps, the behaviour is still the same. I do not know what to do next, there is no info why the job failed in the condor_history and logs on WN. The payload (job agent) works fine when run manually under the appropriate user. We have planned outage for next week to migrate most of our infra to the AL9 and HTC23, to make thing even more interesting. Side notes: The CEs are ARC and WNs use v1 cgroups because v2 are not working. Cheers AM -- Alexandr Mikula OddÄlenà sÃÅovÃnà a vÃpoÄetnà techniky & VÃpoÄetnà stÅedisko FyzikÃlnà Ãstav Akademie vÄd Äeskà republiky, v. v. i. Institute of Physics of the Czech Academy of Sciences
Attachment:
smime.p7s
Description: S/MIME cryptographic signature