Re: [HTCondor-users] Is our Defrag working?

Hi Jeff,

your logs are a bit puzzling indeed ;)

How did you configure the defrag daemon and how is your total of machines in the pool ?

Some of the condor mechanism do not scale down to very small numbers from my experience.

When it comes to draining, if I remember it correctly the number of 'whole-machines' includes those that become a whole machine due to natural events like jobs finishing and so on. Which leads to a sometimes weird looking behaviour of the daemon ...

I think condor_status -defrag will give you the most correct view of which machines are currently draining ?

Best

christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx

Von: "Jeff Templon" <templon@xxxxxxxxx>
An: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Gesendet: Montag, 8. Juli 2024 15:01:49
Betreff: [HTCondor-users] Is our Defrag working?

Hi,

Our defrag daemon seems to be in a funny state of working and also not working. In any case, the logging seems like it cannot be correct, as it says that it is draining and also not draining.

Exhibit 1: the defrag log never says itâs draining:

$ grep urrently DefragLog | cut -d" " -f3-8 | sort -n | uniq -c

5 Couldn't fetch startd ads using constraint

1149 There are currently 0 draining and

5 There are currently -1 draining and

Exhibit 2 : the DefragLog regularly says itâs draining:

$ grep nitiating DefragLog | cut -d" " -f3-8 | sort -n | uniq -c | sort -nr

65 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx

7 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxx

4 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx

3 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx

Exhibit 3 : the DefragLog says itâs both draining and not draining:

07/05/24 12:00:54 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx

07/05/24 12:00:54 Expected draining completion time is 580s; expected draining badput is 4786 cpu-seconds

07/05/24 12:00:54 Drained maximum number of machines allowed in this cycle (1).

07/05/24 12:00:54 Drained 1 machines (wanted to drain 1 machines).

07/05/24 12:05:55 There are currently 0 draining and 1 whole machines.

07/05/24 12:05:55 Set of current whole machines is

07/05/24 12:05:55 wn-sate-079.nikhef.nl

07/05/24 12:05:55 Set of current draining machines is

07/05/24 12:05:55 (no machines)

07/05/24 12:05:55 Newly Arrived whole machines is

07/05/24 12:05:55 (no machines)

07/05/24 12:05:55 Newly departed draining machines is

07/05/24 12:05:55 (no machines)

If it had just drained the machine, how did that not even take one second? And why is it then not listed under the whole machines?

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/