[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problem with schedd ad ?



On Fri, 06 Jan 2006 11:04:07 +0100  Jean-Christophe BACCON wrote:

> With condor 6.7.10, I have the following error message in my negociator
> logs :
...
> 1/5 18:36:40 Phase 4.1:  Negotiating with schedds ...
> 1/5 18:36:40   Error!  Could not get Name and ScheddIpAddr from ad
> 1/5 18:36:40 ---------- Finished Negotiation Cycle ----------
> 
> This message is repeated all time and no more job goes in RUN state (but
> previously running jobs continue normally).

weird.  we saw the same bug.  i fixed the negotiator in 6.7.14 (sorry
it's not in the version history... it's also fixed in the forthcoming
6.6.11 release, and we still don't have a good system for documenting
bug fixes that happen in multiple releases) so that when this happens,
it doesn't abort the entire negotiation cycle, it just ignores the
badly formed schedd classad and tries to negotiate with other schedds.
so, if you upgrade your central manager to 6.7.14, when you have this
problem, at least it won't prevent other schedds from being able to
negotiate and run jobs.

however, we were never able to reproduce the problem that was causing
the schedd ads to show up like this in the first place.  i have some
suspicions, because the code in the schedd responsbile for generating
these classads is a mess and it needs to be re-written (this has been
on our development to-do list for quite some time).  so, instead of
trying to really analyze what's causing this bug, we decided to just
fix the negotiator so it's not such a catastrophic failure when it
happens, re-write the schedd's code that's generating the ads, and
hope that the problem goes away once we clean everything up.

> But I have an "unexpanded" job in my queue :
...
> 104 jobs; 5 idle, 98 running, 0 held, 1 unexpanded
> 
> What does this mean ?

long ago, we distinguished between jobs that have never run
(unexpanded) and jobs that tried to run at least once but are
currently not running (idle).  so, when you first submitted jobs to
condor, they used to show up in the queue with status "U"
(unexpanded), and only would be "I" (idle) once they had started
running somewhere and were then evicted for some reason.  however, we
haven't used this "U" state in ages, so i don't know why condor_q is
telling you that one of your jobs is unexpanded... that's pretty
weird.  

> What is the problem ?

unfortunately, i don't know.  i know the solution will be a newer
version of the condor_schedd, but i can't say exactly when we're going
to have a chance to fix this stuff.  certainly before 6.8.0, but i
don't know exactly what 6.7.x release it'll show up in.

sorry i can't be more help,
-derek