[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor_submit interactive broken on 23.0.16?



Hi Carles,

you're not alone. Seeing the same on a 23.0.16 AP (excerpt from
StarterLog):

10/21/24 09:36:56 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:36:57 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:36:58 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:36:59 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:37:00 (pid:2703835) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/
htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
10/21/24 09:37:01 (pid:2703835) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys
/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/cgroup.subtree_control: Device or resource busy
10/21/24 09:37:01 (pid:2703990) Successfully moved procid 2703990 to cgroup /sys/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd/cgroup.procs
10/21/24 09:37:01 (pid:2703990) Error setting cgroup cpu weight of 100 in cgroup /sys/fs/cgroup/htcondor/condor_batch_slot1_1@xxxxxxxxxxxxxxxxxxxxx/sshd: No such file or directory
10/21/24 09:37:01 (pid:2703990) Error enabling per-cgroup oom killing: 2 (No such file or directory)
10/21/24 09:37:01 (pid:2703835) Create_Process succeeded, pid=2703990
10/21/24 09:37:01 (pid:2703835) ProcFamilyDirectCgroupV2::has_been_oom_killed cannot open /sys/fs/cgroup/memory.events: 2 No such file or directory
10/21/24 09:37:01 (pid:2703835) Process exited, pid=2703954, status=0
10/21/24 09:37:01 (pid:2703835) unhandled job exit: pid=2703954, status=0
10/21/24 09:37:01 (pid:2703835) Process exited, pid=2703838, signal=15

Looks a bit like
https://opensciencegrid.atlassian.net/browse/HTCONDOR-2438 - but that's
supposed to be fixed in 23.0.16

Cheers,
Andreas

On Fri, 2024-10-18 at 10:01 +0200, Carles Acosta wrote:
> Dear all,
> 
> We'veÂupdated some of our EP to condor 23.0.16, but the condor_submit
> interactive failsÂafter the upgrade.
> 
> $ condor_submit -i submit_file 
> Submitting job(s).
> 1 job(s) submitted to cluster 32.
> Waiting for job to start...
> Welcome to slot1_1@xxxxxxxxxxxxxx!
> Connection to condor-job.hnode51.pic.es closed by remote host.
> Connection to condor-job.hnode51.pic.es closed.
> 
> On the EP side:
> 
> [...]
> 10/18/24 09:47:06 (pid:1977164) error getting family usage for pid
> 1981447 in VanillaProc::JobReaper()
> 10/18/24 09:47:06 (pid:1977164)
> ProcFamilyDirectCgroupV2::processesInCgroup cannot open
> /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
> d/cgroup.procs: 2 No such file or directory
> 10/18/24 09:47:07 (pid:1977164)
> ProcFamilyDirectCgroupV2::processesInCgroup cannot open
> /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
> d/cgroup.procs: 2 No such file or directory
> 10/18/24 09:47:08 (pid:1977164)
> ProcFamilyDirectCgroupV2::processesInCgroup cannot open
> /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
> d/cgroup.procs: 2 No such file or directory
> 10/18/24 09:47:09 (pid:1977164)
> ProcFamilyDirectCgroupV2::processesInCgroup cannot open
> /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
> d/cgroup.procs: 2 No such file or directory
> 10/18/24 09:47:10 (pid:1977164)
> ProcFamilyDirectCgroupV2::processesInCgroup cannot open
> /sys/fs/cgroup/htcondor/condor_nvme1_execute_slot1_1@xxxxxxxxxxxxxx/ssh
> d/cgroup.procs: 2 No such file or directory
> 10/18/24 09:47:11 (pid:1977164) All jobs have exited... starter exiting
> 10/18/24 09:47:11 (pid:1977164) **** condor_starter (condor_STARTER)
> pid 1977164 EXITING WITH STATUS 0
> 
> I can attach the complete error if needed.Â
> 
> If the EPs are on stable version 23.0.15 or development version 23.10.1
> they don't show this issue. The AP is always on 23.0.10, but it seems
> to be a cgroups error focused in the startd.
> 
> Thank you.
> 
> Cheers,
> 
> Carles
> 
> -- 
> Carles Acosta i Silva
> PIC (Port d'Informacià CientÃfica)
> Campus UAB, Edifici D
> E-08193 Bellaterra, Barcelona
> Tel: +34 93 581 33 08
> Fax: +34 93 581 41 10
> http://www.pic.esÂ;
> AvÃs - Aviso - Legal Notice: Âhttp://legal.ifae.es
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to
> htcondor-users-request@xxxxxxxxxxxÂwith a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/

-- 
| Andreas Haupt            | E-Mail: andreas.haupt@xxxxxxx
| DESY, Zeuthen            | WWW:    http://www.zeuthen.desy.de/~ahaupt
| Platanenallee 6          | Phone: +49/33762/7-7359
| D-15738 Zeuthen          |






Attachment: smime.p7s
Description: S/MIME cryptographic signature