Hi Everyone, My on premise HTCondor pool is a set size with a known number of cores across a set number of machines.
I recently found a suggestion that it appeared would help balance the load a little bit.
NEGOTIATOR_PRE_JOB_RANK = isUndefined(RemoteOwner) * (- SlotId) And as advertised this definitely changed the system so that instead of loading all of the jobs on one machine it is spreading the jobs across all of the machines BUT I just realized it broke
another part of my system. Under the default formula NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + (1000000 * (RemoteOwner =?= UNDEFINED)) - (100000 * Cpus) - Memory Jobs match the core with the least amount of memory needed first, and my load balancing depends on that. I have some cores set aside for the big memory jobs and jobs with little or no memory
requirements should go to the other cores first and only use the big memory allocated cores if nothing else is available and nothing else needs them. The spreading jobs across many nodes appears to be filling slot #1 first but that is not one of my lower memory allocation slots. I want it to match the lower memory cores first, leaving the
cores with larger memory allocations available for those jobs that come later in the sequence. THE REAL QUESTION IS HERE--------------- I am guessing there is a way to combine these two interests? I want to spread the jobs widely across machines instead of filling up all the cores on one machine first -- but I also want the
memory requirements to be given weight. More info that might help or might just confuse.. Ideally â 1st job goes to low memory core on machine x.
2nd job goes to low memory core on a different machine etc. Then when there is one job on each machine then the next job goes to the second low memory core on a node etc. In the meantime when a job requiring more memory shows up it can go to a larger memory core on machine one on the machines? But if all of the low memory cores fill up and a larger core is available go ahead and let a small job use it. The job run times are short enough to absorb that wait. The current problem
is that the small jobs are taking up the bigger cores in a way that is noticably delaying the start of large jobs Mary Mary Romelfanger Deputy Branch Manager/ Principal Computer Scientist Data Systems Branch .___. {o,o} Phone 410-338-6708 Space Telescope Science Institute 3700 San Martin Drive Baltimore, MD 21218 |