Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On Thu, Oct 30, 2025 at 8:41âAM Cole Bollig <cabollig@xxxxxxxx> wrote:

Hi Curtis,

The gap is not caused by assigning the job to a slot. At the point of the first message, the Starter should have completed setting up the jobs environment/scratch space and transferred all input files in preparation for executing the job. Due to potential job execution deferral via cron specifications, the Starter doesn't actually spawn the job right then. Rather it sets a timer (in the non-deferral case to zero) to handle in daemon core (our event driven code handler). Even though the job spawn is ready to go, daemon core may service other actions (i.e. Timers, commands, pipes, etc) that are ready to be handled. You could try upping the starts debug with D_FULLDEBUG to hopefully see what is happening between the two messages.

I have multiple things regarding your cgroups question and comments

I do not believe this is affecting the time gap between the two messages in the previous question since the starter does cgroup setup after the second message

USE_CGROUPS is no longer a configuration option. This was an option while we were still developing cgroup support and had less confidence in HTCondor's integration with cgroups. However, we are trying to move to a world of always using cgroups (when HTCondor is able) as cgroups provides much better monitoring, control, and guarantees of clean up.

What version of HTCondor is this EP running?

-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, October 29, 2025 3:37 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
Â

I'm trying to understand the logs in the `/var/log/condor/StarterLog.slot1_*` files. I have two questions:

1. Why is there a 12 second gap between the job being "set to execute immediately" and the job actually starting? Is that the amount of time between the job being queued/idle and being assigned a slot to start running? InÂother words, if there were a slot available when the job was queued, it would start immediately and there would be little to no delay?

```

10/29/25 12:55:34 (pid:1098229) Job 129525.0 set to execute immediately
10/29/25 12:55:46 (pid:1098229) Starting a VANILLA universe job with ID: 129525.0

```

2. Why does condor spend 5 seconds trying and failing to open the cgroup file? I have tried setting `USE_CGROUPS = False` and `CGROUP_MEMORY_LIMIT_POLICY = none` to try to try to eliminate this delay but that hasn't helped. I haven't decided whether I want to use cgroups or not so it would be nice to know how to eliminate this delay both when using cgroups and when not.

```

10/29/25 12:55:46 (pid:1098229) About to exec /path/to/example/exec

10/29/25 12:55:46 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:47 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:48 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:49 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:50 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:51 (pid:1102573) Successfully moved procid 1102573 to cgroup /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs
10/29/25 12:55:51 (pid:1098229) Create_Process succeeded, pid=1102573

```

Here are the CGROUP configuration macros before disabling cgroups:

```

~# condor_config_val -dump | grep -i cgroupÂ

BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = soft
CGROUP_POLLING_INTERVAL = 5

```

and after:

```

~# condor_config_val -dump | grep -i cgroupÂ

BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = none
CGROUP_POLLING_INTERVAL = 5
USE_CGROUPS = False

```

Thanks,

Curtis

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs

I'm trying to understand the logs in the `/var/log/condor/StarterLog.slot1_*` files. I have two questions: