Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Issue: TotalDisk is not the current amount of the free disk space on the machines
- Date: Thu, 25 Feb 2021 07:12:08 -0600
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Issue: TotalDisk is not the current amount of the free disk space on the machines
On 2/15/2021 7:31 PM, Carlos Luque
wrote:
Hello all,
I'm addressing an issue about the current free disk space
detected by the daemon condor_startd. The condor version is 8.8.11
running GNU/Linux
I checked the amount of disk space on the execute machines is less
than the current disk space and/or vice versa. For example, in a
machine the TotalDisk is 4529828 KiB, but the current amount of
disk space is 74357772 KiB. In another case, the amount of disk
space is 4 KiB and the TotakDisk detected is 54742440 KiB. None
of machines was running any job during the checking.
Hi Carlos,
HTCondor manages the disk space for job scratch directories. These
directories are created in the subdirectory specified by the EXECUTE
config knob (usually /var/lib/condor/execute). HTCondor assumes
that it is the only service using disk space on the volume where the
EXECUTE directory lives (enter "condor_config_val execute" to see
that path). If you have other services or users running on your
nodes that can use up significant disk space on the same volume
where the EXECUTE directory lives, it could cause problems.
Here at the University of Wisconsin, for example, our execute nodes
have a separate disk partition for EXECUTE for exclusive use by
HTCondor.
When the HTCondor service is started (specifically, when the
condor_startd launches), it examines the free disk space on the
volume where EXECUTE lives and publishes that as TotalDisk. In
other words, at startup it does the equal of setting TotalDisk to:
df -k --output=avail `condor_config_val _execute`
HTCondor then assumes the available disk it discovered at startup
what it should manage. If something other than HTCondor consumes a
lot of space, or frees a lot of space, on the disk volume where
EXECUTE lives after HTCondor is started, that could explain the
behavior you see above.
If you are using static slots, you could try putting the following
in the config:
# Tell the condor_startd to periodically (every ~10 min) update
TotalDisk
# based on available space on the EXECUTE volume. If this setting
is
# switched back to False (which is the default), then the startd
only
# sets TotalDisk once at startup.
STARTD_RECOMPUTE_DISK_FREE = True
Setting STARTD_RECOMPUTE_DISK_FREE to True is not recommended with
partitionable slots. And to be honest, no matter what you do, if
disk space is tight enough that you need it carefully managed, then
you need to ensure nothing else besides jobs managed by HTCondor is
reading/writing files on the EXECUTE disk partition.
More below...
Moreover,
the explanation of the 'Disk' attribute says 23000 = 23MiB in the
section Machine ClassAd attribute. Is it kiB or kB for the
attribute Disk ?
It is the number of bytes divided by 1024. So by the ISO 8000
standard it is KiB, and by the JEDEC standard it is KB.
Could
someone give me some hints to figure out this issue about the
amount of the free space in the TotalDisk?
Thanks in advanced.
Hope the above helps,
regards,
Todd