HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] What is the starter trying to do at line 321 in NTsenders.C?



Hey Eric, thanks for the quick reply.

> 1. Is your schedd or submit machine very busy? The shadow doesn't do
much
> for this call, just update the logfile and update the job queue for a
few
> attributes. If your submit machine is very busy, I suppose the shadow
> could be blocked for 5 minutes trying to finish.

I'm flooding the system with 10 minute-long sleep jobs. We've got 200
VMs, 95% of them Windows and 5% Linux, that are running jobs. The Linux
machines are communicating just fine with the condor_shadow process on
sj-schedd1. For the past 2 weeks the Windows machines have consistently
failed, always going Claimed+Idle, after about 12 hours of running jobs
fine. They always fail with these assert errors.

sj-schedd1 (our submit machine) is busy. But it's not the busiest pool
we have at the company. Both in terms of job turn over rate and total
queued jobs it's about 1/2 as busy as our scheduler in our Toronto pool.
Same hardware (a 2x2 Sun AMD machine). But different version of RHEL (we
run RHEL3 in Toronto and RHEL4 in San Jose).

We're going over network topology now but nothing is jumping out. Linux
and Windows executors share the same switches in the server room and we
don't see starter <-> shadow communication issues on the Linux starters.
The machines had been running jobs just fine for several months prior to
this problem occuring.

> 2. Are you running a mixed Condor pool? I don't see anything that's
> changed between 6.8.0 and more recent versions that could cause the 
> protocol to have changed, but I suppose it's possible something snuck
> in without us realizing it.

I was just about to start diff'ing the 6.8.0 vs 6.8.5 starter code
actually. The pool is 6.8.0 for the central scheduler and central
collector, all on RHEL4 x86_64. The Windows clients are all running
6.8.0 on Windows XP (a mix of 32-bit and 64-bit). And the Linux clients
are actually running 6.7.12 for RHEL3.

- Ian

-Erik