-----Original Message-----
From: Ian Chesal
Sent: Friday, November 21, 2008 4:58 PM
To: Ian Chesal; 'Condor-Users Mail List'
Subject: RE: 6.8.6 -> 7.0.5 on Windows taking a long time to
vacate jobs
I'm in the process of moving Windows machines from 6.8.6 to
7.0.5 and I was noting issues with thrashing and my startd
RANK policy
on machines running 7.0.5. It appears that 7.0.5 takes a very long
time to preempt running jobs when a higher startd RANK job comes
along. I can switch in 6.8.6 for 7.0.5 and the same jobs
take only a
minute to preempt but when I move to 7.0.5 the jobs take
10 minutes
to preempt.
In the time it takes to preempt jobs on 7.0.5-based machines the
waiting jobs give up their claim.
I tried increasing REQUEST_CLAIM_TIMEOUT from 900 to 1200
seconds but
it didn't make a difference. It's not diserable for my preemption
policy to push that number too much higher.
Has something changed from 6.8.6 to 7.0.5 in the way Condor
is killing
jobs when they're preempted? I'm wondering why this
operation takes so
much longer in 7.0.5 than it did in 6.8.6. These are plain vanilla
universe jobs. So no checkpointing.
Actually, if I change REQUEST_CLAIM_TIMEOUT and do a
'condor_reconfig
-full -all' does it apply to newly spawned shadows or do I have to
restart Condor completely on my schedulers for this to take effect?
Digging around a bit it could be related to:
http://www.cs.wisc.edu/condor/manual/v7.0/3_6Security.html#sec
:RunAsNobody
I have:
SLOT1_USER=ALTERA\cndrusr1
SLOT2_USER=ALTERA\cndrusr2
SLOT3_USER=ALTERA\cndrusr3
SLOT4_USER=ALTERA\cndrusr4
SLOT5_USER=ALTERA\cndrusr5
SLOT6_USER=ALTERA\cndrusr6
SLOT7_USER=ALTERA\cndrusr7
SLOT8_USER=ALTERA\cndrusr8
But I set:
DEDICATED_EXECUTE_ACCOUNT_REGEXP = cndrusr[0-9]+
Should I have included the domain in that regexp?
DEDICATED_EXECUTE_ACCOUNT_REGEXP = ALTERA\\cndrusr[0-9]+
My machines are saying USE_PROCD is undefined (and it is in
my configs) but it's starting up so does that mean
condor_startd is using condor_procd to track and kill
processes on my Windows machines? Could this be my problem?
That procd is doing this work?
- Ian