Hi Steffen, I guess RESERVED_DISK = 131072might be the culprit. I just checked in the documentation and the ad is in MB, i.e., the reservation for non-condor stuff of ~131GB would surpass the available 124G on your /var (unfortunately, prefixes/sizes are sometimes a bit inconsistent)
Another thing I noticed - is the execute directory on a dedicated volume? Else
STARTD_RECOMPUTE_DISK_FREE = falsemight be a problem in cases, where /var get filled by other processes (like logs) and the available disk space shrinks for jobs as well.
Cheers, Thomas On 21/04/2023 10.30, Steffen Grunewald wrote:
Good morning, after setting up HTCondor 10.0.3 on our local cluster, I'm running into issues related to disk space and requirements. root@h0402:~# condor_config_val -dump -expand EXECUTE # Configuration from machine: h0402.hypatia.local # Parameters with names that match EXECUTE: ENCRYPT_EXECUTE_DIRECTORY = false ENCRYPT_EXECUTE_DIRECTORY_FILENAMES = false EXECUTE = /var/lib/condor/execute GANGLIAD_PER_EXECUTE_NODE_METRICS = true LOCAL_UNIV_EXECUTE = /var/lib/condor/spool/local_univ_execute # Contributing configuration file(s): # /etc/condor/condor_config # /etc/condor/condor_config_local| root@h0402:~# df -h /var/lib/condor/execute Filesystem Size Used Avail Use% Mounted on /dev/sda5 125G 455M 124G 1% /var root@h0402:~# condor_config_val -dump -expand DISK # Configuration from machine: h0402.hypatia.local # Parameters with names that match DISK: CONSUMPTION_DISK = quantize(target.RequestDisk,{1024}) CREATE_LOCKS_ON_LOCAL_DISK = true FILE_TRANSFER_DISK_LOAD_THROTTLE = 2.0 FILE_TRANSFER_DISK_LOAD_THROTTLE_LONG_HORIZON = 5m FILE_TRANSFER_DISK_LOAD_THROTTLE_SHORT_HORIZON = 1m FILE_TRANSFER_DISK_LOAD_THROTTLE_WAIT_BETWEEN_INCREMENTS = 60 JOB_DEFAULT_REQUESTDISK = 131072 LOCAL_DISK_LOCK_DIR = MODIFY_REQUEST_EXPR_REQUESTDISK = quantize(RequestDisk,{1024}) RESERVED_DISK = 131072 SCHEDD_ROUND_ATTR_DiskUsage = 25% STARTD_RECOMPUTE_DISK_FREE = false # Contributing configuration file(s): # /etc/condor/condor_config # /etc/condor/condor_config_local| root@h0402:~# condor_status -l `hostname`| grep ^Disk Disk = 0 Since $(JOB_DEFAULT_REQUEST_DISK) > $(DISK) there's no way to run vanilla universe jobs. The manual, under DISK and RESERVED_DISK, suggests that the startd would determine the amount of available space (of which there's plenty), but for me obviously it doesn't. Is there a means to find out why? Thanks, Steffen
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature