Hi Carles,
you're not alone. Seeing the same on a 23.0.16 AP (excerpt from
StarterLog):
10/21/24 09:36:56 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:36:57 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:36:58 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:36:59 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:37:00 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:37:01 (pid:2703835) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys
/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/cgroup.subtree_control: Device or resource busy
10/21/24 09:37:01 (pid:2703990) Successfully moved procid 2703990 to cgroup /sys/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs
10/21/24 09:37:01 (pid:2703990) Error setting cgroup cpu weight of 100 in cgroup /sys/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory
10/21/24 09:37:01 (pid:2703990) Error enabling per-cgroup oom killing: 2 (No such file or directory)
10/21/24 09:37:01 (pid:2703835) Create_Process succeeded, pid=2703990
10/21/24 09:37:01 (pid:2703835) ProcFamilyDirectCgroupV2::has_been_oom_killed cannot open /sys/fs/cgroup/memory.events: 2 No such file or directory
10/21/24 09:37:01 (pid:2703835) Process exited, pid=2703954, status=0
10/21/24 09:37:01 (pid:2703835) unhandled job exit: pid=2703954, status=0
10/21/24 09:37:01 (pid:2703835) Process exited, pid=2703838, signal=15
Looks a bit like
https://opensciencegrid.atlassian.net/browse/HTCONDOR-2438 - but that's
supposed to be fixed in 23.0.16
Cheers,
Andreas
On Fri, 2024-10-18 at 10:01 +0200, Carles Acosta wrote:
Dear all,
We'veÂupdated some of our EP to condor 23.0.16, but the condor_submit
interactive failsÂafter the upgrade.
$ condor_submit -i submit_file
Submitting job(s).
1 job(s) submitted to cluster 32.
Waiting for job to start...
Welcome to slot1_1@xxxxxxxxxxxxxx!
Connection to condor-job.hnode51.pic.es closed by remote host.
Connection to condor-job.hnode51.pic.es closed.
On the EP side:
[...]
10/18/24 09:47:06 (pid:1977164) error getting family usage for pid
1981447 in VanillaProc::JobReaper()
10/18/24 09:47:06 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:07 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:08 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:09 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:10 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:11 (pid:1977164) All jobs have exited... starter exiting
10/18/24 09:47:11 (pid:1977164) **** condor_starter (condor_STARTER)
pid 1977164 EXITING WITH STATUS 0
I can attach the complete error if needed.
If the EPs are on stable version 23.0.15 or development version 23.10.1
they don't show this issue. The AP is always on 23.0.10, but it seems
to be a cgroups error focused in the startd.
Thank you.
Cheers,
Carles
--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.es
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxxÂwith a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/