HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] What is the starter trying to do at line 321 in NTsenders.C?



> How do you clear it up, or does it clear up on its own? Do you restart

> a daemon, all the daemons, or the whole machine?

It won't clear up automatically. I can clear it up by holding everything
in the queue and killing off the condor_shadow processes. Then when I
release the queue I'll get another short window of goodput time before
Claimed+Idle takes over again.

> It does sort of sound like something specific to Windows and specific
to
> the execute side, but what does the ShadowLog say during one of these
> failures? How about the StartLog and the MasterLog on the execute
side?

Turning on D_ALL may have been a bad choice here. My log files have
rolled over and I only saved the StarterLog I quoted from initially. I
will be flooding this system again in a bit and I'll save all the log
files from the central scheduler, negotiator and a few execute nodes.

We have narrowed it down to Windows condor_stater to the Linux
condor_shadow processes. The Linux condor_starter to condor_shadow
communication seems fine. In the same system, with Windows machine going
Claimed+Idle, the Linux executors continue to run their jobs
successfully.

I'm in the process of upgrading our central scheduler to RHEL4 U5
because there were some errata published for TCP/IP networking issues in
RHEL4 U1 (the current RHEL4 install we're using on that machine). Seems
kind of weird it'd bite us "all of sudden". 

I've also upgraded my central scheduler and central negotiator/collector
on Condor 6.8.5 and turned debugging down to D_FULLDEBUG, D_SYSCALL,
D_PID.


> Anything change on the Windows machines? (Automatic patches?)

I have IS admins claiming nothing has been patched on these machines.
I'm verifying this myself now.

Thanks for the help! I'll send along more log information with the next
update.

- Ian