On Fri, Mar 26, 2010 at 12:44 PM, Nick LeRoy <nleroy@xxxxxxxxxxx>
wrote:
Mag,
Once over 1000 jobs hit the pool, I start to see the question
marks.
Is there some setting I can look at to fix this?
Just had a discussion here about this, and we have a number of
questions..
1. What version of Condor are you running? A recent performance
enhancement
could possibly be malfunctioning and causing the problems.
The version we are running is 7.2.4
2. Do you know what the jobs are doing during these "events"? Is
there a
pattern to them? For example, when you run your 'condor_q -run',
do you
sometimes see all jobs good, and on other runs a grouping of
'??????' jobs?
These jobs are heterogeneous. Some of them are using a simple awk,
perl, R, and Octave.
3. I think that it'd be helpful if you could post the following:
3a. job log snippet(s) around the window in which you've seen the
problem
3b. ShadowLog snippet(s) of the same
Finally, some observations and a window into our thoughts:
1. When you run 'condor_q -run', it's equivalent to running:
condor_q -const 'JobStatus==2' -format ...
I will try this when the problem occurs. This usually occurs when the
other department lets us use their systems for overnight simulations.
2. It's possible that there's a race condition in which the job's
status
(JobStatus) has been set to RUNNING (2) without the RemoteHost
attribute being
set. This should never happen, but it obviously is. The answers
to the above
questions may help us to isolate how this is happening.
Thanks Mag,
-Nick
--
<<< Welcome to the real world. >>>
/`-_ Nicholas R. LeRoy The Condor Project
{ }/ http://www.cs.wisc.edu/~nleroy http://www.cs.wisc.edu/condor
\ / nleroy@xxxxxxxxxxx The University of Wisconsin
|_*_| 608-265-5761 Department of Computer
Sciences