Yeah I forgot to mention, it’s all
6.6.9 (maybe the occasional 6.6.8) installs, Vanilla, Windows platform. Well, we switched the condor master to a
quad processor with gigabit Ethernet running Windows 2000 Server, and we’re
also sending jobs from there (and only there), but it doesn’t start the
startd so it cannot run jobs. Unfortunately we’re still having the
same problem. None of the processors ever peaks, even during the longest
queue of jobs, and the network card never registers over 25% or 30% of its
total available bandwidth. The jobs all have the same rank. I
don’t set anything in the submit files, so it’s whatever the
default is, 0 I guess. But on the machines that are claimed+idle, if I do
a “condor_q –run” I get a whole bunch of jobs that think they’re
running on [???????????????????????] as machine name. Only the jobs that
are actually on a machine set to claimed+busy have a real machine name next to
them. The machines are most definitely idle, since it’s after hours,
the loads are showing up as 0 in condor_status, and if I only submit around,
say, 100 jobs, they all start, and when the queue gets to 200 or more, only a
handful will continue to run jobs. It’s like clockwork. If I
submit 100, bam, they all run, I submit 100 more before they’re done, and
I only get a handful of claimed+busy machines until they chew down to around
100 jobs again. What about the DEACTIVATE_CLAIM_FORCIBLY
that shows up in all the logs? Does that mean the master is telling the
worker node that it should stay idle because something is timing out? I’m sort of stumped since the
negotiator and schedd’s on the master/submitter never even register 25%
on a processor in the performance monitor, nothing else is running on that
machine, and the network card never runs out of bandwidth. However, I’ll
change the PREEMPTION_REQUIREMENTS setting you mentioned. -Zack |