Dear all,
I'm doing numerical experiments, solving optimization problems and
collect the log files to compare different algorithms.
The program requires about 12 GB memory to solve a problem.
The machine I am using is a cluster of 27 nodes, and each of them has 12 slots.Â
1 slot has 2 GB of memory.
Following is my current condor submission.
universe = vanilla
notification = never
should_transfer_files = yes
when_to_transfer_output = always
copy_to_spool = false
requirements = regexp("slot([1-9]|1[0-2])@pedigree-([1-9]|1[0-9]|2[0-7]).*",Name)
request_memory = 12000
executable = limit.sh
 ÂÂ
output = out
error Â= err
log  Â= log
transfer_input_files = program, input_file
arguments = 22600 12000000 ./program -f input_file --algorithm search
queue
I am submitting 100 ~ 200 jobs at once hoping that condor schedules jobs for me.
It was fine until I was using the memory less than 4 GB for each job.
What I am seeing is:Â
condor assigns each job to a single node, so more than 2 jobs assigned to 1 node.
as the program solves the input_problem it will take more memory.
At some point, some of the jobs become suspended and eventually go idle.
I guess it is due HTCondor try to allocate the resource within a single machine rather than using unclaimed slots.
I confirmed this by submitting small number of jobs and HTCondor didn't use 300 slots available.
I changed above header as,
requirements = regexp("slot([1-9]|1[0-2])@pedigree-([1-9]|1[0-9]|2[0-7]).*",Name) && ( Memeory >= 12000 )
request_memory = 12000
But it didn't resolve this issue.
If someone could suggest a way to modify the condor submission?
Thanks in advance.
Best,
Junkyu Lee