[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Is our Defrag working?



Hi Jeff,

we are currently not using the drain daemon and are in the middle of the EL9 upgrade today ;)

Are you sure your 2 drain daemons are not interfering with each other ?

The ouput is roughly like I rmember it and you can ignore a lot of it - it is more like a dump of the actual defrag related classadds ...

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Jeff Templon" <templon@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Dienstag, 9. Juli 2024 10:10:19
Betreff: Re: [HTCondor-users] Is our Defrag working?

Hi Christoph,

On 9 Jul 2024, at 08:23, Beyer, Christoph <christoph.beyer@xxxxxxx> wrote:

Hi Jeff,

your logs are a bit puzzling indeed ;)

How did you configure the defrag daemon and how is your total of machines in the pool ?

DAEMON_LIST = $(DAEMON_LIST) COLLECTOR DEFRAG
DEFRAG_INTERVAL = 300
DEFRAG_SCHEDULE = peaceful
DEFRAG_DRAINING_MACHINES_PER_HOUR = 1.0
DEFRAG_MAX_WHOLE_MACHINES = 20
DEFRAG_MAX_CONCURRENT_DRAINING = 20

Some of the condor mechanism do not scale down to very small numbers from my experience.

Itâs not so small - almost 6000 cores.

When it comes to draining, if I remember it correctly the number of 'whole-machines' includes those that become a whole machine due to natural events like jobs finishing and so on. Which leads to a sometimes weird looking behaviour of the daemon â

That could be part of it, yet I should have seen at least once that a machine was still draining after one cycle, but this seems never the case.  I can try to completely fill the cluster with test jobs to see if that makes a difference, but the cluster has been virtually full a few times during the observation period.


I think condor_status -defrag will give you the most correct view of which machines are currently draining ?

The output is very strange!
The first part looks like one might expect:

root@stbc-019:~# condor_status -defrag | head -64
Name                                            Draining     Peak TotalDrained

DEFRAG@xxxxxxxxxxxxxxxxxx                              0        0          272
My Pool - stbc-019.nikhef.nl@xxxxxxxxxxxxxxxxxx
stbc-019.nikhef.nl
stbc-019.nikhef.nl
DEFRAG@xxxxxxxxxxxxxxxxxx                              0        0           57
stbc-020.nikhef.nl
stbc-020.nikhef.nl

Makes sense, because we have two pools - the main one, and an express pool, each of which have their own defrag daemon.  Then comes a bunch of identifiers that look like accounting handles (see e.g. the computer.templon and datagrid.templon, two groups under which Iâve submitted jobs)

axx@xxxxxxxxx
ayy@xxxxxxxxx
azzz@xxxxxxxxx
computer.templon@xxxxxxxxx
cosmics.kxxx@xxxxxxxxx
datagrid.templon@xxxxxxxxx
eggg@xxxxxxxxx
ghhhh@xxxxxxxxx
gravwav.abbbb@xxxxxxxxx

Then comes the schedd, twice:

taai-007.nikhef.nl
taai-007.nikhef.nl

Then some more things that look like usernames:

templon@xxxxxxxxx
tyyy@xxxxxxxxx
vrrr@xxxxxxxxx
vssss@xxxxxxxxx
zpppp@xxxxxxxxx
zqqqq@xxxxxxxxx

This I think is the accounting group ânoneâ 

<none>
<none>

Then some more accounting identifiers:

azzzz@xxxxxxxxx
datagrid.templon@xxxxxxxxx
gravwav.axxxx@xxxxxxxxx
group_computer
group_theorie

Then a bunch of execution points and slots on them.

slot1@xxxxxxxxxxxxxxxxxxxxx
slot1@xxxxxxxxxxxxxxxxxxxxx
slot1_1@xxxxxxxxxxxxxxxxxxxxx
slot1_2@xxxxxxxxxxxxxxxxxxxxx
slot1_3@xxxxxxxxxxxxxxxxxxxxx
wn-knek-011.nikhef.nl
slot1@xxxxxxxxxxxxxxxxxxxxx
slot1@xxxxxxxxxxxxxxxxxxxxx
slot1_1@xxxxxxxxxxxxxxxxxxxxx
slot1_2@xxxxxxxxxxxxxxxxxxxxx
slot1_3@xxxxxxxxxxxxxxxxxxxxx
wn-knek-012.nikhef.nl

It looks like either the command, or our configuration, or both, is / are broken.

Does the output on your end look like this?

JT

Ps: thanks for responding!

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Jeff Templon" <templon@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Montag, 8. Juli 2024 15:01:49
Betreff: [HTCondor-users] Is our Defrag working?

Hi,
Our defrag daemon seems to be in a funny state of working and also not working.  In any case, the logging seems like it cannot be correct, as it says that it is draining and also not draining.

Exhibit 1: the defrag log never says itâs draining:

$ grep urrently DefragLog | cut -d" " -f3-8 | sort -n | uniq -c
      5 Couldn't fetch startd ads using constraint
   1149 There are currently 0 draining and
      5 There are currently -1 draining and

Exhibit 2 : the DefragLog regularly says itâs draining:

$ grep nitiating DefragLog | cut -d" " -f3-8 | sort -n | uniq -c | sort -nr
     65 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
      7 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxx
      4 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
      3 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
      3 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx

Exhibit 3 : the DefragLog says itâs both draining and not draining:

07/05/24 12:00:54 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
07/05/24 12:00:54 Expected draining completion time is 580s; expected draining badput is 4786 cpu-seconds
07/05/24 12:00:54 Drained maximum number of machines allowed in this cycle (1).
07/05/24 12:00:54 Drained 1 machines (wanted to drain 1 machines).
07/05/24 12:05:55 There are currently 0 draining and 1 whole machines.
07/05/24 12:05:55 Set of current whole machines is
07/05/24 12:05:55        wn-sate-079.nikhef.nl
07/05/24 12:05:55 Set of current draining machines is
07/05/24 12:05:55 (no machines)
07/05/24 12:05:55 Newly Arrived whole machines is
07/05/24 12:05:55 (no machines)
07/05/24 12:05:55 Newly departed draining machines is
07/05/24 12:05:55 (no machines)

If it had just drained the machine, how did that not even take one second?  And why is it then not listed under the whole machines?

JT


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/