[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTcondor disk resource related queries



Hello Tomer,

Thanks for sharing the configuration. it helps to put the job on hold breaching the requestdisk. We have a problem in our infra where people don't ask for the request disk in job spec hence I want to modify it on a worker machine based on some logic related to CPUs. I am seeing strange behavior.

RequestDisk will remain intact whatever we put in the job submit file 2GB but I Âcouldn't understand where it's picking the Disk attribute. By default it's ~ 4GB

# condor_who -af:h globaljobid disk DiskUsage TotalDisk TotalSlotDisk RequestDisk

globaljobid                    Âdisk   DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
test.example.com#429.0#1685829846 4271297 Â27 Â Â Â Â4271296648 4271297.0 Â Â Â Â Â Â 2097152

Attempt 1 : Try to modify the RequestDisk to 4GB but it becomes 8GB - May be addition of default 4GB

MODIFY_REQUEST_EXPR_REQUESTDISK = 4194304

globaljobid                    Âdisk   DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
test.example.com#430.0#1685830072 8542594 Â27 Â Â Â Â4271296648 8542594.0 Â Â Â Â Â Â 2097152


Attempt 2 : Try to modify the RequestDisk to 6GB but it becomes 8GB - If we go by 4GB addition logic it should have been 10GB

MODIFY_REQUEST_EXPR_REQUESTDISK = 6291456


globaljobid                    Âdisk   DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
test.example.com#431.0#1685830179 8542594 Â2 Â Â Â Â 4271296648 8542594.0 Â Â Â Â Â Â 2097152

Attempt 3 : Try to modify the RequestDisk to 8GB as expected it becomes 12GB.

MODIFY_REQUEST_EXPR_REQUESTDISK = 8388608

globaljobid                    Âdisk   DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
test.example.com#428.0#1685829703 12813890 8192027 Â 4271296648 12813890.0 Â Â Â Â Â Â2097152 Â Â Â Â Â

Attempt 4 : Try to modify the disk size to 1GB. it retains 4GB size.

MODIFY_REQUEST_EXPR_REQUESTDISK = 1048576

globaljobid                    Âdisk   DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
test.example.com#432.0#1685830887 4271297 Â2 Â Â Â Â 4271296648 4271297.0 Â Â Â Â Â Â 2097152


Command used to grab outputs:

condor_who -af:h globaljobid disk DiskUsage TotalDisk TotalSlotDisk RequestDisk


Finally more confusion with negative disk values in following output:

# condor_status `hostname` -server
Name                      OpSys    Arch  LoadAv Memory  Disk   ÂMips  ÂKFlops

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX    X86_64 Â0.000  172962 -57841021  22492  1705677
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Â Â Â X86_64 Â0.000 Â Â19218 Â12813890 Â 22492 Â 1705677
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Â Â Â X86_64 Â0.000 Â Â19218 Â 8542594 Â 22492 Â 1705677
slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Â Â Â X86_64 Â0.000 Â Â19218 Â 8542594 Â 22492 Â 1705677
slot1_4@xxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Â Â Â X86_64 Â0.000 Â Â19218 Â 4271297 Â 22492 Â 1705677

       ÂMachines Avail ÂMemory    ÂDisk    ÂMIPS   ÂKFLOPS

 X86_64/LINUX    Â5   5   Â249834 18446744073685880970   Â112460   8528385

    ÂTotal    Â5   5   Â249834 18446744073685880970   Â112460   8528385




Questions:

- From where it's picking the default 4GB Disk size?
- Why is it setting Disk size to different values than what we ask in the modify _expression_?
- Why in -server output we see negative disk value.


htcondor version : 9.0.17



Regards,
Vikrant Aggarwal

On Thu, 1 Jun, 2023, 09:38 Tomer Pearl, <tomerp@xxxxxxxxxxx> wrote:
Hi Vikrant,

The following configuration works for me. Not sure which version I'm running, should be 9+.

STARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) RequestDisk
DISK_USAGE_EXCEEDED = (JobUniverse !=13 && DiskUsage =!= UNDEFINED && DiskUsage > RequestDisk)
use POLICY:ÂWANT_HOLD_IF = (DISK_USAGE_EXCEEDED, 105, my error string..).

Not sure ifÂmy error string.. should be surroundedÂby quotation marks, as I'm templating the file with Jinja.

Tomer.


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Vikrant Aggarwal <ervikrant06@xxxxxxxxx>
Sent: Thursday, June 1, 2023 12:44 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTcondor disk resource related queries
Â
Hello Experts,

I am testing this configuration to put the jobs on hold breaching the disk limit.Â

STARTD_JOB_ATTRS = $(STARTD_JOB_ATTRS) RequestDisk
DISK_USAGE_EXCEEDED = (JobUniverse =!=13 && DiskUsage =!= UNDEFINED && DiskUsage > RequestDisk)
WANT_HOLD = $(DISK_USAGE_EXCEEDED)
WANT_HOLD_REASON = "Job exceeded disk usage limits"

I clearly see the jobs are using more than RequestDisk size still they are not getting held.Â

# condor_who -af:h globaljobid disk DiskUsage TotalDisk TotalSlotDisk RequestDisk

globaljobid                    Âdisk   DiskUsage TotalDisk ÂTotalSlotDisk     RequestDisk
test.example.com#412.0#1685567906 21356484 8192026 Â 4271296648 21356484.0 Â Â Â Â Â Â16777216 Â
test.example.com#413.0#1685567923 12813890 8192026 Â 4271296648 12813890.0 Â Â Â Â Â Â8388608 Â Â
test.example.com#414.0#1685567952 8542594 Â8192026 Â 4271296648 8542594.0 Â Â Â Â Â Â 3250000 Â Â
test.example.com#415.0#1685568493 8542594 Â8192025 Â 4271296648 8542594.0 Â Â Â Â Â Â 3250000 Â Â
test.example.com#416.0#1685568803 12813890 8192026 Â 4271296648 12813890.0 Â Â Â Â Â Â10000000 Â
test.example.com#417.0#1685568954 4271297 Â8192025 Â 4271296648 4271297.0 Â Â Â Â Â Â 1Â Â

9.0.17 is htcondor version I am using.Â


Thanks & Regards,
Vikrant Aggarwal


On Tue, May 30, 2023 at 1:09âPM Vikrant Aggarwal <ervikrant06@xxxxxxxxx> wrote:
Hello Experts,

Couple of queries:

- Why it's showing negative value for primary partitionable slot.Â

# condor_status `hostname` -server
Name                      OpSys    Arch  LoadAv Memory  Disk   ÂMips  ÂKFlops Â

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxx  LINUX    X86_64 Â0.000  211398 -25210961  25601  1764976
slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Â Â Â X86_64 Â0.000 Â Â19218 Â 4278313 Â 25601 Â 1764976
slot1_2@xxxxxxxxxxxxxxxxxxxxxxxxxx LINUX Â Â Â X86_64 Â0.000 Â Â19218 Â 4278313 Â 25601 Â 1764976

       ÂMachines Avail ÂMemory    ÂDisk    ÂMIPS   ÂKFLOPS

 X86_64/LINUX    Â3   3   Â249834 18446744073692897281    76803   5294928

    ÂTotal    Â3   3   Â249834 18446744073692897281    76803   5294928


# condor_status -compact `hostname` -af Disk
4269756335


-Â I have this on worker node conf to modify the job request disk to mentioned value but it never worked. We are using similar _expression_ for cpu and memory, it works fine.Â

# condor_config_val MODIFY_REQUEST_EXPR_REQUESTDISK
80000

Not sure from where it's picking this value.Â

# grep -r 'Disk =' /spare/condor/dir_14*/.machine.ad
/spare/condor/dir_1417831/.machine.ad:Disk = 4278313
/spare/condor/dir_1417831/.machine.ad:TotalDisk = 4278312960
/spare/condor/dir_1417831/.machine.ad:TotalSlotDisk = 4278313.0
/spare/condor/dir_1425169/.machine.ad:Disk = 4278313
/spare/condor/dir_1425169/.machine.ad:TotalDisk = 4278312960
/spare/condor/dir_1425169/.machine.ad:TotalSlotDisk = 4278313.0


# du -sh /spare/condor/dir_1425169
3.0G Â Â/spare/condor/dir_1425169

Thanks & Regards,
Vikrant Aggarwal
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/