[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_submit interactive broken on 23.0.16?



Hello Andreas,

You are correct. The fix for HTCONDOR-2438 was omitted. We will release 23.0.17 soon with this fix.

...Tim

On 10/21/24 03:00, Andreas Haupt wrote:
Hi Carles,

you're not alone. Seeing the same on a 23.0.16 AP (excerpt from
StarterLog):

10/21/24 09:36:56 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:36:57 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:36:58 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:36:59 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:37:00 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:37:01 (pid:2703835) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys
/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/cgroup.subtree_control: Device or resource busy
10/21/24 09:37:01 (pid:2703990) Successfully moved procid 2703990 to cgroup /sys/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs
10/21/24 09:37:01 (pid:2703990) Error setting cgroup cpu weight of 100 in cgroup /sys/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory
10/21/24 09:37:01 (pid:2703990) Error enabling per-cgroup oom killing: 2 (No such file or directory)
10/21/24 09:37:01 (pid:2703835) Create_Process succeeded, pid=2703990
10/21/24 09:37:01 (pid:2703835) ProcFamilyDirectCgroupV2::has_been_oom_killed cannot open /sys/fs/cgroup/memory.events: 2 No such file or directory
10/21/24 09:37:01 (pid:2703835) Process exited, pid=2703954, status=0
10/21/24 09:37:01 (pid:2703835) unhandled job exit: pid=2703954, status=0
10/21/24 09:37:01 (pid:2703835) Process exited, pid=2703838, signal=15

Looks a bit like
https://opensciencegrid.atlassian.net/browse/HTCONDOR-2438 - but that's
supposed to be fixed in 23.0.16

Cheers,
Andreas

On Fri, 2024-10-18 at 10:01 +0200, Carles Acosta wrote:
Dear all,

We'veÂupdated some of our EP to condor 23.0.16, but the condor_submit
interactive failsÂafter the upgrade.

$ condor_submit -i submit_file
Submitting job(s).
1 job(s) submitted to cluster 32.
Waiting for job to start...
Welcome to slot1_1@xxxxxxxxxxxxxx!
Connection to condor-job.hnode51.pic.es closed by remote host.
Connection to condor-job.hnode51.pic.es closed.

On the EP side:

[...]
10/18/24 09:47:06 (pid:1977164) error getting family usage for pid
1981447 in VanillaProc::JobReaper()
10/18/24 09:47:06 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:07 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:08 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:09 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:10 (pid:1977164)
ProcFamilyDirectCgroupV2::processesInCgroup cannot open
/sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
d/cgroup.procs: 2 No such file or directory
10/18/24 09:47:11 (pid:1977164) All jobs have exited... starter exiting
10/18/24 09:47:11 (pid:1977164) **** condor_starter (condor_STARTER)
pid 1977164 EXITING WITH STATUS 0

I can attach the complete error if needed.

If the EPs are on stable version 23.0.15 or development version 23.10.1
they don't show this issue. The AP is always on 23.0.10, but it seems
to be a cgroups error focused in the startd.

Thank you.

Cheers,

Carles

--
Carles Acosta i Silva
PIC (Port d'Informacià CientÃfica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.es
AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to
htcondor-users-request@xxxxxxxxxxxÂwith a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--
Tim Theisen (he, him, his)
Release Manager
HTCondor & Open Science Grid
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736