After examining Accountantnew.log, we think we see what's happening.
1) On a job's first billing cycle, the job is considered to be running in the node's partitionable slot. Since we're using the default SlotWeight (= Cpus), the partitionable slot has a weight equal to the number of unused cores, and the job gets charged accordingly.
103 Resource.slot1@node0050.tower-research.com@<10.xxx:9618?addrs=10.xxx-9618&noUDP&sock=1281881_a2e2_3> SlotWeight 28.000000
2) On the second billing cycle, the job is assigned to the dynamic slot. This job requested 1 CPU.
103 Resource.
slot1_1@node0050.skae.tower-research.com@<10.xxx:9618?addrs=10.xxx-9618&noUDP&sock=1281881_a2e2_3> SlotWeight
1.000000We deal with this problem by defining SLOT_WEIGHT as ifThenElse(SlotType == "Partitionable", 1, Cpus). This undercharges for the first billing cycle for multicore jobs, but they comprise only a small percentage of our jobs.
The second issue is that the length of the billing cycles is a function of NEGOTIATOR_INTERVAL and NEGOTIATOR_CYCLE_DELAY. We had reduced the former from the default of 60 seconds; hence the overcharge for jobs was approximately TotalCpus * NEGOTIATOR_INTERVAL = 28 * 30 = 560 seconds.
By reducing both of these negotiator parameters to 1 second, we can get reasonably accurate billing. So far, the NegotiatorRecentDaemonCoreDutyCycle is staying below 50%.
Thanks for everyone's help.
Jon