We've run into a problem using partitionable slots on large multicore machines running Windows. We have several machines with 32 or more processor threads, but at most 9 of these cores get utilized for running jobs.NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100% , ram=100%, swap=90%, disk=90%
SLOT_TYPE_1_PARTITIONABLE = true
Poring over logs reveals that jobs are trying to start on slots 10 through 32, but get killed immediately due to a 10054 error. It looks like the user used to run these jobs (condor-reuse-slot1_XX) cannot be created, thus resulting in permission errors. Windows usernames appear to have a limit of 20chars, which looks like it's causing the 21-character condor username to fail (condor-reuse-slot1_X is OK but condor-reuse-slot1_XX is not).
The current workaround we've employed is creating 4 partitionable slots, each with 25% share of resources. Of course this means that the maximum amount of ram etc. that any single job can use is more limited than it would be using a single partitionable slot. Note, this is not a problem on our linux machines.
Is this a known bug? Is there a better solution/workaroundi
Thanks
Alex
-- Alex M. Chubaty, PhD
Postdoctoral Researcher | Stagiaire postdoctoral (recherche)
Canadian Forest Service
Pacific Forestry Centre
506 Burnside Road W
Victoria, BC V8Z 1M5
Università Laval
Facultà de foresterie, de gÃographie et de gÃomatique
DÃpartement des sciences du bois et de la forÃt
phone: +1.250.298.2347
email: achubaty@xxxxxxxxxxx
http://alexchubaty.com