|
Hi Curtis,
My original thought was correct, in the fact that the Starter is process a different daemon core event (send keep alive to parent daemon i.e. the StartD). That being said, it seems peculiar that the keep alive is taking ten plus seconds. What does the StartD
log say for the same time period? Judging by your concern, is this delay seen for every job that executes on this EP?
Also, you may see the cgroup open failure messages disappear if you upgrade to v25 LTS. Do note that if you have scripts on the host that utilize our Python API, the v1 API is no longer available and you will need to switch to using the v2 API.
-Cole Bollig
From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Thursday, October 30, 2025 12:10 PM To: Cole Bollig <cabollig@xxxxxxxx> Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx> Subject: Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs Hi Colin,
I have added `STARTER_DEUBG = D_FULLDEBUG` and re-ran my jobs. Here's what I see in between the two lines mentioned earlier.
```
10/30/25 10:00:30 Job 131381.0 set to execute immediately
10/30/25 10:00:30 DaemonKeepAlive: in SendAliveToParent() 10/30/25 10:00:30 SharedPortClient: sent connection request to daemon at <192.168.5.71:9618> for shared port id startd_4419_a519 10/30/25 10:00:41 Completed DC_CHILDALIVE to daemon at <192.168.5.71:9618> 10/30/25 10:00:41 DaemonKeepAlive: Leaving SendAliveToParent() - success 10/30/25 10:00:41 Starting a VANILLA universe job with ID: 131381.0 ```
Good to know about cgroups. I'm fine with using it, provided it doesn't have a negative impact on my cluster's performance. Here's the output of condor_version:
```
$CondorVersion: 24.0.13 2025-10-08 BuildID: 840511 PackageID: 24.0.13-1+deb13 GitSHA: 371e82aa $
$CondorPlatform: X86_64-Debian_13 $ ```
Thanks,
Curtis
On Thu, Oct 30, 2025 at 8:41âAM Cole Bollig <cabollig@xxxxxxxx> wrote:
|