Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Lazy jobs that never really start running
- Date: Wed, 6 Jul 2005 11:58:03 +0100
- From: Matt Hope <matthew.hope@xxxxxxxxx>
- Subject: Re: [Condor-users] Lazy jobs that never really start running
On 7/6/05, Horvatth Szabolcs <szabolcs@xxxxxxxxxxxxx> wrote:
> I forgot to add that I'm using 6.7.8 on Windows machines.
Have you done a condor_store_cred ?
have you changed your password since you last did...
It is an annoying flaw/bug/gripe with the windows functionality that
if your credential is wrongly stored the jobs in the queue will
continue to match, attempt to run on a machine, the shadow is started
on the local machine as you but barfs, the job gets kicked off the
previous machine (after sitting there for a bit wasting time). rinse,
repeat.
I recommend any windows pools to run the following command on a regular* basis
condor_q -global -constraint "JobRunCount>=500"
This will however have potential false positives if you have long
running jobs which can check point. It does however tend to spot
people who have failed to store_cred since a password change very
nicely.
Matt
* Talking hourly here since it does put a load on the schedd's