[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTcondor disk resource related queries



Hi Vikrant,

The following configuration works for me. Not sure which version I'm running, should be 9+.

STARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) RequestDisk
DISK_USAGE_EXCEEDED = (JobUniverse !=13 && DiskUsage =!= UNDEFINED && DiskUsage > RequestDisk)
use POLICY: WANT_HOLD_IF = (DISK_USAGE_EXCEEDED, 105, my error string..).

Not sure if my error string.. should be surrounded by quotation marks, as I'm templating the file with Jinja.

Tomer.


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Thursday, June 1, 2023 12:44 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTcondor disk resource related queries
 
Hello Experts,

I am testing this configuration to put the jobs on hold breaching the disk limit. 

STARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) RequestDisk
DISK_USAGE_EXCEEDED = (JobUniverse =!=13 && DiskUsage =!= UNDEFINED && DiskUsage > RequestDisk)
WANT_HOLD = $(DISK_USAGE_EXCEEDED)
WANT_HOLD_REASON = "Job exceeded disk usage limits"

I clearly see the jobs are using more than RequestDisk size still they are not getting held. 

# condor_who -af:h globaljobid disk DiskUsage TotalDisk TotalSlotDisk RequestDisk

globaljobid                                        disk     DiskUsage TotalDisk  TotalSlotDisk         RequestDisk
test.example.com#412.0#1685567906 21356484 8192026   4271296648 21356484.0            16777216  
test.example.com#413.0#1685567923 12813890 8192026   4271296648 12813890.0            8388608    
test.example.com#414.0#1685567952 8542594  8192026   4271296648 8542594.0             3250000    
test.example.com#415.0#1685568493 8542594  8192025   4271296648 8542594.0             3250000    
test.example.com#416.0#1685568803 12813890 8192026   4271296648 12813890.0            10000000  
test.example.com#417.0#1685568954 4271297  8192025   4271296648 4271297.0             1   

9.0.17 is htcondor version I am using. 


Thanks & Regards,
Vikrant Aggarwal


On Tue, May 30, 2023 at 1:09âPM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Experts,

Couple of queries:

- Why it's showing negative value for primary partitionable slot. 

# condor_status `hostname` -server
Name                                           OpSys       Arch   LoadAv Memory   Disk      Mips    KFlops  

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx   LINUX       X86_64  0.000   211398 -25210961   25601   1764976
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxx LINUX       X86_64  0.000    19218   4278313   25601   1764976
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxx LINUX       X86_64  0.000    19218   4278313   25601   1764976

               Machines Avail  Memory        Disk        MIPS      KFLOPS

  X86_64/LINUX        3     3      249834 18446744073692897281       76803     5294928

         Total        3     3      249834 18446744073692897281       76803     5294928


# condor_status -compact `hostname` -af Disk
4269756335


-  I have this on worker node conf to modify the job request disk to mentioned value but it never worked. We are using similar _expression_ for cpu and memory, it works fine. 

# condor_config_val MODIFY_REQUEST_EXPR_REQUESTDISK
80000

Not sure from where it's picking this value. 

# grep -r 'Disk =' /spare/condor/dir_14*/.machine.ad
/spare/condor/dir_1417831/.machine.ad:Disk = 4278313
/spare/condor/dir_1417831/.machine.ad:TotalDisk = 4278312960
/spare/condor/dir_1417831/.machine.ad:TotalSlotDisk = 4278313.0
/spare/condor/dir_1425169/.machine.ad:Disk = 4278313
/spare/condor/dir_1425169/.machine.ad:TotalDisk = 4278312960
/spare/condor/dir_1425169/.machine.ad:TotalSlotDisk = 4278313.0


# du -sh /spare/condor/dir_1425169
3.0G    /spare/condor/dir_1425169

Thanks & Regards,
Vikrant Aggarwal
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.