[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs



Hi Curtis,

I'm still messing around with things on my end but wanted to share my current hypothesis. I think the Startd is increasingly busy from the fast-running jobs. It would be interesting to see what the duty cycle of the Startd is: condor_status -const 'Machine=="host.name"' -af:h RecentDaemonCoreDutyCycle

-Cole Bollig

From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Monday, November 3, 2025 2:53 PM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
 
Hi Cole,

No problem! Job duration varies from a couple of seconds to hours depending on the job. The particular set of jobs I have been testing with typically take anywhere from 2-10 seconds each and there are around 5,500 jobs in the DAG that I am submitting.

Thanks,

Curtis

On Mon, Nov 3, 2025 at 12:41âPM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Curtis,

Another quick question while I look at the information you provided last week (sorry for taking a bit to look), how fast are the jobs in the cluster? Do they complete in seconds, minutes, hours?

-Cole Bollig

From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Friday, October 31, 2025 1:34 PM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
 
No problem. I appreciate the help! One thing I have noticed is that when I only submit a few (e.g. 20) jobs to run on the EP, there is no delay between "execute immediately" and "starting a VANILLA universe job". But if I submit lots of jobs (e.g. more than enough to occupy all of the slots on the EP), jobs start slowly. This particular EP has 256 threads/slots and all of the jobs I submit are single threaded and only requesting a single CPU.

Here's the output of `/var/log/condor/StarterLog.slot1_1`
```
10/31/25 09:27:02 Job 177042.0 set to execute immediately
10/31/25 09:27:02 DaemonKeepAlive: in SendAliveToParent()
10/31/25 09:27:02 SharedPortClient: sent connection request to daemon at <192.168.5.71:9618> for shared port id startd_82377_5941
10/31/25 09:27:14 Completed DC_CHILDALIVE to daemon at <192.168.5.71:9618>
10/31/25 09:27:14 DaemonKeepAlive: Leaving SendAliveToParent() - success
10/31/25 09:27:14 Starting a VANILLA universe job with ID: 177042.0
```

Here's the output of `/var/log/condor/StartLog` during the gap above
```
10/31/25 09:27:02 slot1_1: Got activate_claim request from shadow (192.168.5.70)
10/31/25 09:27:02 slot1_1: Remote job ID is 177042.0
10/31/25 09:27:02 slot1_1: Got universe "VANILLA" (5) from request classad
10/31/25 09:27:02 slot1_1: State change: claim-activation protocol successful
10/31/25 09:27:02 slot1_1: Changing activity: Idle -> Busy
10/31/25 09:27:21 slot1_1: Called deactivate_claim_forcibly()
10/31/25 09:27:21 slot1_1: State change: starter exited
10/31/25 09:27:21 slot1_1: Changing activity: Busy -> Idle
```

Here's the output of `condor_config_val -summary`
```
# condor_config_val $CondorVersion: 24.0.13 2025-10-08 BuildID: 840511 PackageID: 24.0.13-1+deb13 GitSHA: 371e82aa $

#
# from /etc/condor/condor_config
#
LOCAL_DIR = /var
LOCAL_CONFIG_FILE = /etc/condor/condor_config.local
REQUIRE_LOCAL_CONFIG_FILE = false
LOCAL_CONFIG_DIR = /usr/share/condor/config.d,/etc/condor/config.d
RUN = $(LOCAL_DIR)/run/condor
LOG = $(LOCAL_DIR)/log/condor
LOCK = $(LOCAL_DIR)/lock/condor
SPOOL = $(LOCAL_DIR)/spool/condor
EXECUTE = $(LOCAL_DIR)/lib/condor/execute
CRED_STORE_DIR = $(LOCAL_DIR)/lib/condor/cred_dir
ETC = /etc/condor
LIB = $(RELEASE_DIR)/lib/condor
INCLUDE = $(RELEASE_DIR)/include/condor
LIBEXEC = $(RELEASE_DIR)/libexec/condor
SHARE = $(RELEASE_DIR)/share/condor
SINGULARITY = /usr/bin/singularity
GANGLIA_LIB64_PATH = /lib,/usr/lib,/usr/local/lib
AUTH_SSL_SERVER_CERTFILE = /etc/ssl/certs/ssl-cert-snakeoil.pem
AUTH_SSL_SERVER_KEYFILE = /etc/ssl/private/ssl-cert-snakeoil.key

#
# from /etc/condor/config.d/00-custom.config
#
UID_DOMAIN = .ltc
TRUST_UID_DOMAIN = True
SEC_DEFAULT_AUTHENTICATION_METHODS = IDTOKENS
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = IDTOKENS
SEC_CLIENT_AUTHENTICATION_METHODS = IDTOKENS
SEC_TOOL_AUTHENTICATION = REQUIRED
SEC_TOOL_AUTHENTICATION_METHODS = IDTOKENS
SEC_WRITE_AUTHENTICATION = REQUIRED
SEC_WRITE_AUTHENTICATION_METHODS = IDTOKENS
SEC_ADMINISTRATOR_AUTHENTICATION = REQUIRED
SEC_ADMINISTRATOR_AUTHENTICATION_METHODS = IDTOKENS
SEC_CONFIG_AUTHENTICATION = REQUIRED
SEC_CONFIG_AUTHENTICATION_METHODS = IDTOKENS
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = IDTOKENS
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_TOKEN_SYSTEM_KEYFILE = /etc/condor/passwords.d/POOL
SEC_TOKEN_POOL_SIGNING_KEYFILE = /etc/condor/passwords.d/POOL
ALLOW_CONFIG = $(CONDOR_HOST)
CONDOR_HOST = basan.ltc
DAEMON_LIST = MASTER, STARTD
STARTD_ATTRS = , Machine
STARTD_CRON_JOBLIST =
STARTD_CRON_Machine = 0 */6 * * * /usr/bin/condor_status -long | grep -E "^Machine\s*=" | sed 's/^Machine\s*=\s*//' | sort | uniq -c | awk '{print $2 " " $1}' | head -1 | awk '{print $1}'
MAX_DEFAULT_LOG = 1000000
MAX_DEFAULT_LOG_HISTORY = 1000000
MAX_DEFAULT_LOG_HISTORY_SIZE = 1000000
COLLECTOR_SOCKET_CACHE_SIZE = 100
COLLECTOR_SOCKET_CACHE_TIMEOUT = 300
CGROUP_MEMORY_LIMIT_POLICY = soft
SCHEDD_INTERVAL = 10
QUEUE_SUPER_USERS = root,<redacted>
DAGMAN_MAX_SUBMITS_PER_INTERVAL = 1000
DAGMAN_MAX_POST_SCRIPTS = 3
JOB_DEFAULT_REQUESTMEMORY = 8192
POOL_HISTORY_MAX_STORAGE = 100000000
MAX_HISTORY_LOG = 104857600
CONDOR_Q_ONLY_MY_JOBS = false
CONDOR_Q_DASH_BATCH_IS_DEFAULT = false
CONDOR_ADMIN = <redacted>
DAGMAN_SUPPRESS_NOTIFICATION = true
EMAIL_DOMAIN = <redacted>
FILESYSTEM_DOMAIN = <redacted>
USE_NFS = true

#
# from /etc/condor/config.d/00-security
#
ALLOW_ADMINISTRATOR = condor@* condor@password root@*
ALLOW_DAEMON = condor@*  condor@password
ALLOW_NEGOTIATOR = condor@* condor@password
ALLOW_READ = *
ALLOW_WRITE = *
SEC_CLIENT_AUTHENTICATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION = required
SEC_DEFAULT_ENCRYPTION = required
SEC_DEFAULT_INTEGRITY = required
SEC_READ_AUTHENTICATION = OPTIONAL
SECURITY_MODEL = 24.0

#
# from /etc/condor/config.d/10-stash-plugin.conf
#
STASH_PLUGIN = $(LIBEXEC)/stash_plugin
FILETRANSFER_PLUGINS = $(LIBEXEC)/curl_plugin, $(LIBEXEC)/data_plugin, $(LIBEXEC)/box_plugin.py, $(LIBEXEC)/gdrive_plugin.py, $(LIBEXEC)/onedrive_plugin.py,$(STASH_PLUGIN)

#
# from /etc/condor/config.d/99-debug.conf
#
STARTER_DEBUG = D_FULLDEBUG
STARTD_DEBUG = D_FULLDEBUG
```

Thanks,

Curtis

On Fri, Oct 31, 2025 at 6:35âAM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Curtis,

Sorry to ask for more logging/testing but could you do this again with D_FULLDEBUG set for the StartD. Also, the output of condor_config_val -summary for one of the EPs would be helpful.

-Cole

From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Thursday, October 30, 2025 5:37 PM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
 
Yes, I am seeing this delay for all of my jobs on all of my EPs.

Here's what I see in the log files (forgive the different timestamps; I've been running jobs to test and my StartLog and StartLog.old no longer contain data from earlier in the day):

/var/log/condor/StarterLog.slot1_1
```
10/30/25 15:13:48 (pid:115368) Job 152380.0 set to execute immediately
10/30/25 15:13:48 DaemonKeepAlive: in SendAliveToParent()
10/30/25 15:13:48 SharedPortClient: sent connection request to daemon at <192.168.5.71:9618> for shared port id startd_4419_a519
10/30/25 15:13:54 Completed DC_CHILDALIVE to daemon at <192.168.5.71:9618>
10/30/25 15:13:54 DaemonKeepAlive: Leaving SendAliveToParent() - success
10/30/25 15:13:54 (pid:115368) Starting a VANILLA universe job with ID: 152380.0
```

/var/log/condor/StartLog
```
10/30/25 15:13:48 slot1_1: Got activate_claim request from shadow (192.168.5.70)
10/30/25 15:13:48 slot1_1: Remote job ID is 152380.0
10/30/25 15:13:48 slot1_1: Got universe "VANILLA" (5) from request classad
10/30/25 15:13:48 slot1_1: State change: claim-activation protocol successful
10/30/25 15:13:48 slot1_1: Changing activity: Idle -> Busy
10/30/25 15:14:06 slot1_1: State change: starter exited
10/30/25 15:14:06 slot1_1: Changing activity: Busy -> Idle
```

I'm running Debian 13 which includes htcondor v24.0x LTS in its official repo. So, I was planning to use that to prevent forced upgrades to new versions when the LTS versions are no longer maintained.

Thanks,

Curtis

On Thu, Oct 30, 2025 at 2:24âPM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Curtis,

My original thought was correct, in the fact that the Starter is process a different daemon core event (send keep alive to parent daemon i.e. the StartD). That being said, it seems peculiar that the keep alive is taking ten plus seconds. What does the StartD log say for the same time period? Judging by your concern, is this delay seen for every job that executes on this EP?

Also, you may see the cgroup open failure messages disappear if you upgrade to v25 LTS. Do note that if you have scripts on the host that utilize our Python API, the v1 API is no longer available and you will need to switch to using the v2 API.

-Cole Bollig

From: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Sent: Thursday, October 30, 2025 12:10 PM
To: Cole Bollig <cabollig@xxxxxxxx>
Cc: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
 
Hi Colin,

I have added `STARTER_DEUBG = D_FULLDEBUG` and re-ran my jobs. Here's what I see in between the two lines mentioned earlier.
```
10/30/25 10:00:30 Job 131381.0 set to execute immediately
10/30/25 10:00:30 DaemonKeepAlive: in SendAliveToParent()
10/30/25 10:00:30 SharedPortClient: sent connection request to daemon at <192.168.5.71:9618> for shared port id startd_4419_a519
10/30/25 10:00:41 Completed DC_CHILDALIVE to daemon at <192.168.5.71:9618>
10/30/25 10:00:41 DaemonKeepAlive: Leaving SendAliveToParent() - success
10/30/25 10:00:41 Starting a VANILLA universe job with ID: 131381.0
```

Good to know about cgroups. I'm fine with using it, provided it doesn't have a negative impact on my cluster's performance. Here's the output of condor_version:
```
$CondorVersion: 24.0.13 2025-10-08 BuildID: 840511 PackageID: 24.0.13-1+deb13 GitSHA: 371e82aa $
$CondorPlatform: X86_64-Debian_13 $
```

Thanks,

Curtis

On Thu, Oct 30, 2025 at 8:41âAM Cole Bollig <cabollig@xxxxxxxx> wrote:
Hi Curtis,

  1. The gap is not caused by assigning the job to a slot. At the point of the first message, the Starter should have completed setting up the jobs environment/scratch space and transferred all input files in preparation for executing the job. Due to potential job execution deferral via cron specifications, the Starter doesn't actually spawn the job right then. Rather it sets a timer (in the non-deferral case to zero) to handle in daemon core (our event driven code handler). Even though the job spawn is ready to go, daemon core may service other actions (i.e. Timers, commands, pipes, etc) that are ready to be handled. You could try upping the starts debug with D_FULLDEBUG to hopefully see what is happening between the two messages.
  2. I have multiple things regarding your cgroups question and comments
    1. I do not believe this is affecting the time gap between the two messages in the previous question since the starter does cgroup setup after the second message
    2. USE_CGROUPS is no longer a configuration option. This was an option while we were still developing cgroup support and had less confidence in HTCondor's integration with cgroups. However, we are trying to move to a world of always using cgroups (when HTCondor is able) as cgroups provides much better monitoring, control, and guarantees of clean up.
    3. What version of HTCondor is this EP running?
-Cole Bollig

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Curtis Spencer via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Wednesday, October 29, 2025 3:37 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Cc: Curtis Spencer <curtis.spencer@xxxxxxxxxxxx>
Subject: [HTCondor-users] Understanding /var/log/condor/StarterLog.slot* logs
 

I'm trying to understand the logs in the `/var/log/condor/StarterLog.slot1_*` files. I have two questions:

1. Why is there a 12 second gap between the job being "set to execute immediately" and the job actually starting? Is that the amount of time between the job being queued/idle and being assigned a slot to start running? In other words, if there were a slot available when the job was queued, it would start immediately and there would be little to no delay?
```
10/29/25 12:55:34 (pid:1098229) Job 129525.0 set to execute immediately
10/29/25 12:55:46 (pid:1098229) Starting a VANILLA universe job with ID: 129525.0
```

2. Why does condor spend 5 seconds trying and failing to open the cgroup file? I have tried setting `USE_CGROUPS = False` and `CGROUP_MEMORY_LIMIT_POLICY = none` to try to try to eliminate this delay but that hasn't helped. I haven't decided whether I want to use cgroups or not so it would be nice to know how to eliminate this delay both when using cgroups and when not.
```
10/29/25 12:55:46 (pid:1098229) About to exec /path/to/example/exec
10/29/25 12:55:46 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:47 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:48 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:49 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:50 (pid:1098229) ProcFamilyDirectCgroupV2::processesInCgroup cannot open /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs: 2 No such file or directory
10/29/25 12:55:51 (pid:1102573) Successfully moved procid 1102573 to cgroup /sys/fs/cgroup/system.slice/condor.service/htcondor/condor_var_lib_condor_execute_slot1_240@xxxxxxxxxxxx/cgroup.procs
10/29/25 12:55:51 (pid:1098229) Create_Process succeeded, pid=1102573
```

Here are the CGROUP configuration macros before disabling cgroups:
```
~# condor_config_val -dump | grep -i cgroup 
BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = soft
CGROUP_POLLING_INTERVAL = 5
```
and after:
```
~# condor_config_val -dump | grep -i cgroup 
BASE_CGROUP = htcondor
CGROUP_IGNORE_CACHE_MEMORY = true
CGROUP_MEMORY_LIMIT_POLICY = none
CGROUP_POLLING_INTERVAL = 5
USE_CGROUPS = False
```

Thanks,

Curtis