Re: [HTCondor-users] Is our Defrag working?

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Jeff,

we are currently not using the drain daemon and are in the middle of the EL9 upgrade today ;)

Are you sure your 2 drain daemons are not interfering with each other ?

The ouput is roughly like I rmember it and you can ignore a lot of it - it is more like a dump of the actual defrag related classadds ...

Best

christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Jeff Templon" <templon@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 9. Juli 2024 10:10:19
Betreff: Re: [HTCondor-users] Is our Defrag working?

Hi Christoph,

On 9 Jul 2024, at 08:23, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

Hi Jeff,

your logs are a bit puzzling indeed ;)

How did you configure the defrag daemon and how is your total of machines in the pool ?

DAEMON_LIST = $(DAEMON_LIST) COLLECTOR DEFRAG

DEFRAG_INTERVAL = 300

DEFRAG_SCHEDULE = peaceful

DEFRAG_DRAINING_MACHINES_PER_HOUR = 1.0

DEFRAG_MAX_WHOLE_MACHINES = 20

DEFRAG_MAX_CONCURRENT_DRAINING = 20

Some of the condor mechanism do not scale down to very small numbers from my experience.

Itâs not so small - almost 6000 cores.

When it comes to draining, if I remember it correctly the number of 'whole-machines' includes those that become a whole machine due to natural events like jobs finishing and so on. Which leads to a sometimes weird looking behaviour of the daemon â

That could be part of it, yet I should have seen at least once that a machine was still draining after one cycle, but this seems never the case. I can try to completely fill the cluster with test jobs to see if that makes a difference, but the cluster has been virtually full a few times during the observation period.

I think condor_status -defrag will give you the most correct view of which machines are currently draining ?

The output is very strange!

The first part looks like one might expect:

root@stbc-019:~# condor_status -defrag | head -64

Name Draining Peak TotalDrained

DEFRAG@xxxxxxxxxxxxxxxxxx 0 0 272

My Pool - stbc-019.nikhef.nl@xxxxxxxxxxxxxxxxxx

stbc-019.nikhef.nl

DEFRAG@xxxxxxxxxxxxxxxxxx 0 0 57

stbc-020.nikhef.nl

Makes sense, because we have two pools - the main one, and an express pool, each of which have their own defrag daemon. Then comes a bunch of identifiers that look like accounting handles (see e.g. the computer.templon and datagrid.templon, two groups under which Iâve submitted jobs)

axx@xxxxxxxxx

ayy@xxxxxxxxx

azzz@xxxxxxxxx

computer.templon@xxxxxxxxx

cosmics.kxxx@xxxxxxxxx

datagrid.templon@xxxxxxxxx

eggg@xxxxxxxxx

ghhhh@xxxxxxxxx

gravwav.abbbb@xxxxxxxxx

Then comes the schedd, twice:

taai-007.nikhef.nl

Then some more things that look like usernames:

templon@xxxxxxxxx

tyyy@xxxxxxxxx

vrrr@xxxxxxxxx

vssss@xxxxxxxxx

zpppp@xxxxxxxxx

zqqqq@xxxxxxxxx

This I think is the accounting group ânoneâ

<none>

Then some more accounting identifiers:

azzzz@xxxxxxxxx

datagrid.templon@xxxxxxxxx

gravwav.axxxx@xxxxxxxxx

group_computer

group_theorie

Then a bunch of execution points and slots on them.

slot1@xxxxxxxxxxxxxxxxxxxxx

slot1_1@xxxxxxxxxxxxxxxxxxxxx

slot1_2@xxxxxxxxxxxxxxxxxxxxx

slot1_3@xxxxxxxxxxxxxxxxxxxxx

wn-knek-011.nikhef.nl

slot1@xxxxxxxxxxxxxxxxxxxxx

slot1_1@xxxxxxxxxxxxxxxxxxxxx

slot1_2@xxxxxxxxxxxxxxxxxxxxx

slot1_3@xxxxxxxxxxxxxxxxxxxxx

wn-knek-012.nikhef.nl

It looks like either the command, or our configuration, or both, is / are broken.

Does the output on your end look like this?

Ps: thanks for responding!

Best
christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Jeff Templon" <templon@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Montag, 8. Juli 2024 15:01:49
Betreff: [HTCondor-users] Is our Defrag working?

Hi,
Our defrag daemon seems to be in a funny state of working and also not working. In any case, the logging seems like it cannot be correct, as it says that it is draining and also not draining.

Exhibit 1: the defrag log never says itâs draining:

$ grep urrently DefragLog | cut -d" " -f3-8 | sort -n | uniq -c
5 Couldn't fetch startd ads using constraint
1149 There are currently 0 draining and
5 There are currently -1 draining and

Exhibit 2 : the DefragLog regularly says itâs draining:

$ grep nitiating DefragLog | cut -d" " -f3-8 | sort -n | uniq -c | sort -nr
65 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
7 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxx
4 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
3 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
3 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx

Exhibit 3 : the DefragLog says itâs both draining and not draining:

07/05/24 12:00:54 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
07/05/24 12:00:54 Expected draining completion time is 580s; expected draining badput is 4786 cpu-seconds
07/05/24 12:00:54 Drained maximum number of machines allowed in this cycle (1).
07/05/24 12:00:54 Drained 1 machines (wanted to drain 1 machines).
07/05/24 12:05:55 There are currently 0 draining and 1 whole machines.
07/05/24 12:05:55 Set of current whole machines is
07/05/24 12:05:55 wn-sate-079.nikhef.nl
07/05/24 12:05:55 Set of current draining machines is
07/05/24 12:05:55 (no machines)
07/05/24 12:05:55 Newly Arrived whole machines is
07/05/24 12:05:55 (no machines)
07/05/24 12:05:55 Newly departed draining machines is
07/05/24 12:05:55 (no machines)

If it had just drained the machine, how did that not even take one second? And why is it then not listed under the whole machines?

JT

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

Mailing List Archives

Authenticated access

Re: [HTCondor-users] Is our Defrag working?