HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-devel] What is the starter trying to do at line 321 in NTsenders.C?



>> I'm in the process of upgrading our central scheduler to RHEL4 U5
because >> there were some errata published for TCP/IP networking issues
in RHEL4 U1 >> (the current RHEL4 install we're using on that machine).
Seems kind of 
>> weird it'd bite us "all of sudden".

> After the RHEL4 U5 upgrade and the Condor 6.8.5 x86_64 upgrade things
are > sane. I'm not convinced the problem is gone yet. But it ran
through jobs
> last night at a rate of ~400 jobs/hour.

> I'm going to change the character of my test jobs to stress the system
and > increase the job through put rate to see if/where I can get it to
fail.

Here's the update. The 6.8.5 x86_64 + RHEL5 U5 updates held up and I
wasn't able to crash the system. Even 2 minute jobs were flying through
without any issues. I pushed the system up to 1200 jobs/hour mark which
is one of the highest job through put rates I've ever taken it to.

I rolled back the 6.8.5 changes to 6.8.0 x86 (not x86_64) and still
couldn't crash the system so that put the focus on RHEL4 U5. Sure enough
there was an update to the Gigabit NIC driver that was used by the NIC
in our Sun X4100 machine that is our central scheduler in RHEL4 U5. I
was going to try rolling back to RHEL4 U1 but time didn't permit it. It
seems pretty likely though that it was a NIC driver issue.

Thanks for your help everyone!

- Ian