How did you configure the defrag daemon and how is your total of machines in the pool ?
Some of the condor mechanism do not scale down to very small numbers from my experience.
When it comes to draining, if I remember it correctly the number of 'whole-machines' includes those that become a whole machine due to natural events like jobs finishing and so on. Which leads to a sometimes weird looking behaviour of the daemon ...
I think condor_status -defrag will give you the most correct view of which machines are currently draining ?
Hi,
Our defrag daemon seems to be in a funny state of working and also not working. In any case, the logging seems like it cannot be correct, as it says that it is draining and also not draining.
Exhibit 1: the defrag log never says itâs draining:
$ grep urrently DefragLog | cut -d" " -f3-8 | sort -n | uniq -c
5 Couldn't fetch startd ads using constraint
1149 There are currently 0 draining and
5 There are currently -1 draining and
Exhibit 2 : the DefragLog regularly says itâs draining:
$ grep nitiating DefragLog | cut -d" " -f3-8 | sort -n | uniq -c | sort -nr
65 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
7 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxx
4 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
3 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
3 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
Exhibit 3 : the DefragLog says itâs both draining and not draining:
07/05/24 12:00:54 Initiating graceful draining of slot1@xxxxxxxxxxxxxxxxxxxxxx
07/05/24 12:00:54 Expected draining completion time is 580s; expected draining badput is 4786 cpu-seconds
07/05/24 12:00:54 Drained maximum number of machines allowed in this cycle (1).
07/05/24 12:00:54 Drained 1 machines (wanted to drain 1 machines).
07/05/24 12:05:55 There are currently 0 draining and 1 whole machines.
07/05/24 12:05:55 Set of current whole machines is
07/05/24 12:05:55 wn-sate-079.nikhef.nl
07/05/24 12:05:55 Set of current draining machines is
07/05/24 12:05:55 (no machines)
07/05/24 12:05:55 Newly Arrived whole machines is
07/05/24 12:05:55 (no machines)
07/05/24 12:05:55 Newly departed draining machines is
07/05/24 12:05:55 (no machines)
If it had just drained the machine, how did that not even take one second? And why is it then not listed under the whole machines?
JT
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/