There is a known bug in condor 6.7.20 and 6.8.0 with non-blocking output.
There is a settings
NONBLOCKING_COLLECTOR_UPDATE
NEGOTIATOR_USE_NONBLOCKING_STARTD_CONTACT
which if set to FALSE will cause the negotiator to crash periodically
for being out of file descriptors but if left at their default TRUE
will cause the schedd and collector to crash.
Bug is supposed to be fixed in 6.8.1. Condor team gave me a pre-release
of schedd, collector, negotiator, which fixed the problem at our site.
If you have to pick your poison, leave them set false as the negotiator
crash condor can easily recover from.
Steve
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525 timm@xxxxxxxx
http://home.fnal.gov/~timm/ Fermilab Computing Div/Core Support Services
Dept./Scientific Computing Section Assistant Group Leader, Farms and
Clustered Systems Group
Lead of Computing Farms Team
On Mon, 4 Sep 2006, Dr Ian C. Smith wrote:
Hi,
I recently upgraded to Condor 6.8.0 on our central manager in order to
fix a problem with Condor. See:
https://lists.cs.wisc.edu/archive/condor-users/2006-August/msg00039.shtml
This solved the problem but instead I started to see exactly the
same "out of file descriptors" messages errors as reported
in
https://lists.cs.wisc.edu/archive/condor-users/2006-April/msg00191.shtml
The symptoms are the same - after the daily reboot of the windows
execution hosts a large number sit idle even though there is a big
(20,000) queue of jobs waiting to run. When I went back to 6.6.9 the
problem disappeared.
I'm wondering if, as has been suggested, that the "out of file
descriptors" is a red herring - the OS is the same (solaris 8) and none
of the limits have been changed. At most there are around 100 jobs
running concurrently with vanilla universe. The default limit (ulimit
-n) is 256 (although I understand that this is per process).
Any ideas about this ? Would a diff(1) of the two codes show up anything.
I could move the Condor-G to another hosts to get around the first
problem but I'm more concerned that the Windows central manager is going
to get stuck with an out of date version of condor.
cheers,
-ian.
-----------------------------------
Dr Ian C. Smith,
e-Science team,
University of Liverpool
Computing Services Department
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR