Hi Carles, you're not alone. Seeing the same on a 23.0.16 AP (excerpt from StarterLog): 10/21/24 09:36:56 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/ htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory 10/21/24 09:36:57 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/ htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory 10/21/24 09:36:58 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/ htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory 10/21/24 09:36:59 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/ htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory 10/21/24 09:37:00 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/ htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory 10/21/24 09:37:01 (pid:2703835) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys /fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/cgroup.subtree_control: Device or resource busy 10/21/24 09:37:01 (pid:2703990) Successfully moved procid 2703990 to cgroup /sys/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs 10/21/24 09:37:01 (pid:2703990) Error setting cgroup cpu weight of 100 in cgroup /sys/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory 10/21/24 09:37:01 (pid:2703990) Error enabling per-cgroup oom killing: 2 (No such file or directory) 10/21/24 09:37:01 (pid:2703835) Create_Process succeeded, pid=2703990 10/21/24 09:37:01 (pid:2703835) ProcFamilyDirectCgroupV2::has_been_oom_killed cannot open /sys/fs/cgroup/memory.events: 2 No such file or directory 10/21/24 09:37:01 (pid:2703835) Process exited, pid=2703954, status=0 10/21/24 09:37:01 (pid:2703835) unhandled job exit: pid=2703954, status=0 10/21/24 09:37:01 (pid:2703835) Process exited, pid=2703838, signal=15 Looks a bit like https://opensciencegrid.atlassian.net/browse/HTCONDOR-2438 - but that's supposed to be fixed in 23.0.16 Cheers, Andreas On Fri, 2024-10-18 at 10:01 +0200, Carles Acosta wrote: > Dear all, > > We'veÂupdated some of our EP to condor 23.0.16, but the condor_submit > interactive failsÂafter the upgrade. > > $ condor_submit -i submit_file > Submitting job(s). > 1 job(s) submitted to cluster 32. > Waiting for job to start... > Welcome to slot1_1@xxxxxxxxxxxxxx! > Connection to condor-job.hnode51.pic.es closed by remote host. > Connection to condor-job.hnode51.pic.es closed. > > On the EP side: > > [...] > 10/18/24 09:47:06 (pid:1977164) error getting family usage for pid > 1981447 in VanillaProc::JobReaper() > 10/18/24 09:47:06 (pid:1977164) > ProcFamilyDirectCgroupV2::processesInCgroup cannot open > /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh > d/cgroup.procs: 2 No such file or directory > 10/18/24 09:47:07 (pid:1977164) > ProcFamilyDirectCgroupV2::processesInCgroup cannot open > /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh > d/cgroup.procs: 2 No such file or directory > 10/18/24 09:47:08 (pid:1977164) > ProcFamilyDirectCgroupV2::processesInCgroup cannot open > /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh > d/cgroup.procs: 2 No such file or directory > 10/18/24 09:47:09 (pid:1977164) > ProcFamilyDirectCgroupV2::processesInCgroup cannot open > /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh > d/cgroup.procs: 2 No such file or directory > 10/18/24 09:47:10 (pid:1977164) > ProcFamilyDirectCgroupV2::processesInCgroup cannot open > /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh > d/cgroup.procs: 2 No such file or directory > 10/18/24 09:47:11 (pid:1977164) All jobs have exited... starter exiting > 10/18/24 09:47:11 (pid:1977164) **** condor_starter (condor_STARTER) > pid 1977164 EXITING WITH STATUS 0 > > I can attach the complete error if needed. > > If the EPs are on stable version 23.0.15 or development version 23.10.1 > they don't show this issue. The AP is always on 23.0.10, but it seems > to be a cgroups error focused in the startd. > > Thank you. > > Cheers, > > Carles > > -- > Carles Acosta i Silva > PIC (Port d'Informacià CientÃfica) > Campus UAB, Edifici D > E-08193 Bellaterra, Barcelona > Tel: +34 93 581 33 08 > Fax: +34 93 581 41 10 > http://www.pic.esÂ; > AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es > _______________________________________________ > HTCondor-users mailing list > To unsubscribe, send a message to > htcondor-users-request@xxxxxxxxxxxÂwith a > subject: Unsubscribe > You can also unsubscribe by visiting > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users > > The archives can be found at: > https://lists.cs.wisc.edu/archive/htcondor-users/ -- | Andreas Haupt | E-Mail: andreas.haupt@xxxxxxx | DESY, Zeuthen | WWW: http://www.zeuthen.desy.de/~ahaupt | Platanenallee 6 | Phone: +49/33762/7-7359 | D-15738 Zeuthen |
Attachment:
smime.p7s
Description: S/MIME cryptographic signature