[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Is our Defrag working?



Hi,

Our defrag daemon seems to be in a funny state of working and also not working.  In any case, the logging seems like it cannot be correct, as it says that it is draining and also not draining.

Exhibit 1: the defrag log never says itâs draining:

$ grep urrently DefragLog | cut -d" " -f3-8 | sort -n | uniq -c
      5 Couldn't fetch startd ads using constraint
   1149 There are currently 0 draining and
      5 There are currently -1 draining and

Exhibit 2 : the DefragLog regularly says itâs draining:

$ grep nitiating DefragLog | cut -d" " -f3-8 | sort -n | uniq -c | sort -nr
     65 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
      7 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxx
      4 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
      3 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
      3 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx

Exhibit 3 : the DefragLog says itâs both draining and not draining:

07/05/24 12:00:54 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
07/05/24 12:00:54 Expected draining completion time is 580s; expected draining badput is 4786 cpu-seconds
07/05/24 12:00:54 Drained maximum number of machines allowed in this cycle (1).
07/05/24 12:00:54 Drained 1 machines (wanted to drain 1 machines).
07/05/24 12:05:55 There are currently 0 draining and 1 whole machines.
07/05/24 12:05:55 Set of current whole machines is
07/05/24 12:05:55        wn-sate-079.nikhef.nl
07/05/24 12:05:55 Set of current draining machines is
07/05/24 12:05:55 (no machines)
07/05/24 12:05:55 Newly Arrived whole machines is
07/05/24 12:05:55 (no machines)
07/05/24 12:05:55 Newly departed draining machines is
07/05/24 12:05:55 (no machines)

If it had just drained the machine, how did that not even take one second?  And why is it then not listed under the whole machines?

JT