[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs



Hi Colin,

I have added `STARTER_DEUBG = D_FULLDEBUG` and re-ran my jobs. Here's what I see in between the two lines mentioned earlier.
```
10/30/25 10:00:30 Job 131381.0 set to execute immediately
10/30/25 10:00:30 DaemonKeepAlive: in SendAliveToParent()
10/30/25 10:00:30 SharedPortClient: sent connection request to daemon at <192.168.5.71:9618> for shared port id startd_4419_a519
10/30/25 10:00:41 Completed DC_CHILDALIVE to daemon at <192.168.5.71:9618>
10/30/25 10:00:41 DaemonKeepAlive: Leaving SendAliveToParent() - success
10/30/25 10:00:41 Starting a VANILLA universe job with ID: 131381.0
```

Good to know about cgroups. I'm fine with using it, provided it doesn't have a negative impactÂon my cluster's performance. Here's the output of condor_version:
```
$CondorVersion: 24.0.13 2025-10-08 BuildID: 840511 PackageID: 24.0.13-1+deb13 GitSHA: 371e82aa $
$CondorPlatform: X86_64-Debian_13 $
```

Thanks,

Curtis

On Thu, Oct 30, 2025 at 8:41âAM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Curtis,

  1. The gap is not caused by assigning the job to a slot. At the point of the first message, the Starter should have completed setting up the jobs environment/scratch space and transferred all input files in preparation for executing the job. Due to potential job execution deferral via cron specifications, the Starter doesn't actually spawn the job right then. Rather it sets a timer (in the non-deferral case to zero) to handle in daemon core (our event driven code handler). Even though the job spawn is ready to go, daemon core may service other actions (i.e. Timers, commands, pipes, etc) that are ready to be handled. You could try upping the starts debug with D_FULLDEBUG to hopefully see what is happening between the two messages.
  2. I have multiple things regarding your cgroups question and comments
    1. I do not believe this is affecting the time gap between the two messages in the previous question since the starter does cgroup setup after the second message
    2. USE_CGROUPS is no longer a configuration option. This was an option while we were still developing cgroup support and had less confidence in HTCondor's integration with cgroups. However, we are trying to move to a world of always using cgroups (when HTCondor is able) as cgroups provides much better monitoring, control, and guarantees of clean up.
    3. What version of HTCondor is this EP running?
-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, October 29, 2025 3:37 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
Â

I'm trying to understand the logs in the `/var/log/condor/StarterLog.slot1_*` files. I have two questions:

1. Why is there a 12 second gap between the job being "set to execute immediately" and the job actually starting? Is that the amount of time between the job being queued/idle and being assigned a slot to start running? InÂother words, if there were a slot available when the job was queued, it would start immediately and there would be little to no delay?
```
10/29/25 12:55:34 (pid:1098229) Job 129525.0 set to execute immediately
10/29/25 12:55:46 (pid:1098229) Starting a VANILLA universe job with ID: 129525.0
```

2. Why does condor spend 5 seconds trying and failing to open the cgroup file? I have tried setting `USE_CGROUPS = False` and `CGROUP_MEMORY_LIMIT_POLICY = none` to try to try to eliminate this delay but that hasn't helped. I haven't decided whether I want to use cgroups or not so it would be nice to know how to eliminate this delay both when using cgroups and when not.
```
10/29/25 12:55:46 (pid:1098229) About to exec /path/to/example/exec
10/29/25 12:55:46 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:47 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:48 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:49 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:50 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:51 (pid:1102573) Successfully moved procid 1102573 to cgroup /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs
10/29/25 12:55:51 (pid:1098229) Create_Process succeeded, pid=1102573
```

Here are the CGROUP configuration macros before disabling cgroups:
```
~# condor_config_val -dump | grep -i cgroupÂ
BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = soft
CGROUP_POLLING_INTERVAL = 5
```
and after:
```
~# condor_config_val -dump | grep -i cgroupÂ
BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = none
CGROUP_POLLING_INTERVAL = 5
USE_CGROUPS = False
```

Thanks,

Curtis