[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-devel] What is the starter trying to do at line 321 in NTsenders.C?
- Date: Mon, 11 Jun 2007 20:25:01 -0500
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [Condor-devel] What is the starter trying to do at line 321 in NTsenders.C?
On Mon, Jun 11, 2007 at 06:45:59PM -0400, Ian Chesal wrote:
> Hey Eric, thanks for the quick reply.
>
> > 1. Is your schedd or submit machine very busy? The shadow doesn't do
> much
> > for this call, just update the logfile and update the job queue for a
> few
> > attributes. If your submit machine is very busy, I suppose the shadow
> > could be blocked for 5 minutes trying to finish.
>
> I'm flooding the system with 10 minute-long sleep jobs. We've got 200
> VMs, 95% of them Windows and 5% Linux, that are running jobs. The Linux
> machines are communicating just fine with the condor_shadow process on
> sj-schedd1. For the past 2 weeks the Windows machines have consistently
> failed, always going Claimed+Idle, after about 12 hours of running jobs
> fine. They always fail with these assert errors.
>
How do you clear it up, or does it clear up on its own? Do you restart
a daemon, all the daemons, or the whole machine?
It does sort of sound like something specific to Windows and specific to
the execute side, but what does the ShadowLog say during one of these
failures? How about the StartLog and the MasterLog on the execute side?
> sj-schedd1 (our submit machine) is busy. But it's not the busiest pool
> we have at the company. Both in terms of job turn over rate and total
> queued jobs it's about 1/2 as busy as our scheduler in our Toronto pool.
> Same hardware (a 2x2 Sun AMD machine). But different version of RHEL (we
> run RHEL3 in Toronto and RHEL4 in San Jose).
>
> We're going over network topology now but nothing is jumping out. Linux
> and Windows executors share the same switches in the server room and we
> don't see starter <-> shadow communication issues on the Linux starters.
> The machines had been running jobs just fine for several months prior to
> this problem occuring.
>
Anything change on the Windows machines? (Automatic patches?)
-Erik