[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs



Hi Curtis,

Sorry to ask for more logging/testing but could you do this again with D_FULLDEBUG set for the StartD. Also, the output of condor_config_val -summary for one of the EPs would be helpful.

-Cole

From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Thursday, October 30, 2025 5:37 PM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
 
Yes, I am seeing this delay for all of my jobs on all of my EPs.

Here's what I see in the log files (forgive the different timestamps; I've been running jobs to test and my StartLog and StartLog.old no longer contain data from earlier in the day):

/var/log/condor/StarterLog.slot1_1
```
10/30/25 15:13:48 (pid:115368) Job 152380.0 set to execute immediately
10/30/25 15:13:48 DaemonKeepAlive: in SendAliveToParent()
10/30/25 15:13:48 SharedPortClient: sent connection request to daemon at <192.168.5.71:9618> for shared port id startd_4419_a519
10/30/25 15:13:54 Completed DC_CHILDALIVE to daemon at <192.168.5.71:9618>
10/30/25 15:13:54 DaemonKeepAlive: Leaving SendAliveToParent() - success
10/30/25 15:13:54 (pid:115368) Starting a VANILLA universe job with ID: 152380.0
```

/var/log/condor/StartLog
```
10/30/25 15:13:48 slot1_1: Got activate_claim request from shadow (192.168.5.70)
10/30/25 15:13:48 slot1_1: Remote job ID is 152380.0
10/30/25 15:13:48 slot1_1: Got universe "VANILLA" (5) from request classad
10/30/25 15:13:48 slot1_1: State change: claim-activation protocol successful
10/30/25 15:13:48 slot1_1: Changing activity: Idle -> Busy
10/30/25 15:14:06 slot1_1: State change: starter exited
10/30/25 15:14:06 slot1_1: Changing activity: Busy -> Idle
```

I'm running Debian 13 which includes htcondor v24.0x LTS in its official repo. So, I was planning to use that to prevent forced upgrades to new versions when the LTS versions are no longer maintained.

Thanks,

Curtis

On Thu, Oct 30, 2025 at 2:24âPM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Curtis,

My original thought was correct, in the fact that the Starter is process a different daemon core event (send keep alive to parent daemon i.e. the StartD). That being said, it seems peculiar that the keep alive is taking ten plus seconds. What does the StartD log say for the same time period? Judging by your concern, is this delay seen for every job that executes on this EP?

Also, you may see the cgroup open failure messages disappear if you upgrade to v25 LTS. Do note that if you have scripts on the host that utilize our Python API, the v1 API is no longer available and you will need to switch to using the v2 API.

-Cole Bollig

From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Thursday, October 30, 2025 12:10 PM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
 
Hi Colin,

I have added `STARTER_DEUBG = D_FULLDEBUG` and re-ran my jobs. Here's what I see in between the two lines mentioned earlier.
```
10/30/25 10:00:30 Job 131381.0 set to execute immediately
10/30/25 10:00:30 DaemonKeepAlive: in SendAliveToParent()
10/30/25 10:00:30 SharedPortClient: sent connection request to daemon at <192.168.5.71:9618> for shared port id startd_4419_a519
10/30/25 10:00:41 Completed DC_CHILDALIVE to daemon at <192.168.5.71:9618>
10/30/25 10:00:41 DaemonKeepAlive: Leaving SendAliveToParent() - success
10/30/25 10:00:41 Starting a VANILLA universe job with ID: 131381.0
```

Good to know about cgroups. I'm fine with using it, provided it doesn't have a negative impact on my cluster's performance. Here's the output of condor_version:
```
$CondorVersion: 24.0.13 2025-10-08 BuildID: 840511 PackageID: 24.0.13-1+deb13 GitSHA: 371e82aa $
$CondorPlatform: X86_64-Debian_13 $
```

Thanks,

Curtis

On Thu, Oct 30, 2025 at 8:41âAM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Curtis,

  1. The gap is not caused by assigning the job to a slot. At the point of the first message, the Starter should have completed setting up the jobs environment/scratch space and transferred all input files in preparation for executing the job. Due to potential job execution deferral via cron specifications, the Starter doesn't actually spawn the job right then. Rather it sets a timer (in the non-deferral case to zero) to handle in daemon core (our event driven code handler). Even though the job spawn is ready to go, daemon core may service other actions (i.e. Timers, commands, pipes, etc) that are ready to be handled. You could try upping the starts debug with D_FULLDEBUG to hopefully see what is happening between the two messages.
  2. I have multiple things regarding your cgroups question and comments
    1. I do not believe this is affecting the time gap between the two messages in the previous question since the starter does cgroup setup after the second message
    2. USE_CGROUPS is no longer a configuration option. This was an option while we were still developing cgroup support and had less confidence in HTCondor's integration with cgroups. However, we are trying to move to a world of always using cgroups (when HTCondor is able) as cgroups provides much better monitoring, control, and guarantees of clean up.
    3. What version of HTCondor is this EP running?
-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, October 29, 2025 3:37 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
 

I'm trying to understand the logs in the `/var/log/condor/StarterLog.slot1_*` files. I have two questions:

1. Why is there a 12 second gap between the job being "set to execute immediately" and the job actually starting? Is that the amount of time between the job being queued/idle and being assigned a slot to start running? In other words, if there were a slot available when the job was queued, it would start immediately and there would be little to no delay?
```
10/29/25 12:55:34 (pid:1098229) Job 129525.0 set to execute immediately
10/29/25 12:55:46 (pid:1098229) Starting a VANILLA universe job with ID: 129525.0
```

2. Why does condor spend 5 seconds trying and failing to open the cgroup file? I have tried setting `USE_CGROUPS = False` and `CGROUP_MEMORY_LIMIT_POLICY = none` to try to try to eliminate this delay but that hasn't helped. I haven't decided whether I want to use cgroups or not so it would be nice to know how to eliminate this delay both when using cgroups and when not.
```
10/29/25 12:55:46 (pid:1098229) About to exec /path/to/example/exec
10/29/25 12:55:46 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:47 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:48 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:49 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:50 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:51 (pid:1102573) Successfully moved procid 1102573 to cgroup /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs
10/29/25 12:55:51 (pid:1098229) Create_Process succeeded, pid=1102573
```

Here are the CGROUP configuration macros before disabling cgroups:
```
~# condor_config_val -dump | grep -i cgroup 
BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = soft
CGROUP_POLLING_INTERVAL = 5
```
and after:
```
~# condor_config_val -dump | grep -i cgroup 
BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = none
CGROUP_POLLING_INTERVAL = 5
USE_CGROUPS = False
```

Thanks,

Curtis