[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Error in slot log of 24.0.1



Hello Experts,

As with Rocky 9 default cgroup is v2, are these errorsÂrelated to cgroup are expected in the condor slot log file using dynamic slots.Â

11/06/24 12:55:11 (pid:268499) Using wrapper /usr/local/sbin/os/condor_ldpreload_wrapper.sh to exec /usr/sbin/sshd -i -e -f /spare/condor/dir_268499/.condor_ssh_to_job_1/sshd_config
11/06/24 12:55:11 (pid:268499) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
11/06/24 12:55:12 (pid:268499) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
11/06/24 12:55:13 (pid:268499) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
11/06/24 12:55:14 (pid:268499) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
11/06/24 12:55:15 (pid:268499) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/sshd/cgroup.procs: 2 No such file or directory
11/06/24 12:55:16 (pid:268499) ProcFamilyDirectCgroupV2::track_family_via_cgroup error writing to /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/cgroup.subtree_control: Device or resource busy
11/06/24 12:55:16 (pid:270743) Successfully moved procid 270743 to cgroup /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/sshd/cgroup.procs
11/06/24 12:55:16 (pid:270743) Error setting cgroup cpu weight of 1400 in cgroup /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/sshd: No such file or directory
11/06/24 12:55:16 (pid:270743) Error enabling per-cgroup oom killing: 2 (No such file or directory)
11/06/24 12:55:16 (pid:270743) cgroup v2 could not attach gpu device limiter to cgroup: Operation not permitted
11/06/24 12:55:16 (pid:268499) Create_Process succeeded, pid=270743
11/06/24 12:55:16 (pid:268499) ProcFamilyDirectCgroupV2::has_been_oom_killed cannot open /sys/fs/cgroup/memory.events: 2 No such file or directory
11/06/24 12:55:16 (pid:268499) Process exited, pid=270490, status=0
11/06/24 12:55:16 (pid:268499) unhandled job exit: pid=270490, status=0
11/06/24 12:55:56 (pid:268499) Process exited, pid=269162, status=0
11/06/24 12:55:56 (pid:268499) Failed to write ToE tag to .job.ad file (13): Permission denied
11/06/24 12:56:03 (pid:268499) Failed to open '.update.ad' to read update ad: No such file or directory (2).
11/06/24 12:56:03 (pid:268499) Failed to open '.update.ad' to read update ad: No such file or directory (2).
11/06/24 12:57:03 (pid:268499) Failed to open '.update.ad' to read update ad: No such file or directory (2).
11/06/24 12:57:03 (pid:268499) Failed to open '.update.ad' to read update ad: No such file or directory (2).
11/06/24 12:58:44 (pid:268499) ProcFamilyDirectCgroupV2::has_been_oom_killed cannot open /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/sshd/memory.events: 2 No such file or directory
11/06/24 12:58:44 (pid:268499) Process exited, pid=270743, status=255
11/06/24 12:58:44 (pid:268499) Removing /spare/condor/dir_268499/.condor_ssh_to_job_1
11/06/24 12:58:44 (pid:268499) ProcFamilyDirectCgroupV2::get_usage cannot open /sys/fs/cgroup/system.slice/htcondor/condor_spare_condor_slot1_2@xxxxxxxxxxxxx/sshd/memory.current: 2 No such file or directory
11/06/24 12:58:44 (pid:268499) error getting family usage for pid 270743 in VanillaProc::JobReaper()
11/06/24 12:58:44 (pid:268499) Not entering transfer queue because sandbox (20) is too small (<= 104857600).
11/06/24 12:58:44 (pid:268499) All jobs have exited... starter exiting

cgroup related configuration:Â

$ condor_config_val -dump | grep -i cgroup
BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = none

Thanks & Regards,
Vikrant Aggarwal