Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor_schedd slowness causing job leases to expire
- Date: Wed, 19 Mar 2008 13:02:02 -0400
- From: "Robert E. Parrott" <parrott@xxxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor_schedd slowness causing job leases to expire
Hi Folks,
I'm seeing what appears to be some some that are ended due to slow
responsiveness of the condor_schedd.
In particular, one user's parallel job was terminated when another
user submitted something like 1500 parallel jobs all at once.
The condor_schedd became unresponsive, and condor_q reported that the
condor_schedd didn't respond for a time.
This was on condor v 7.0.1 on the head node, 6.8.5 on the compute nodes.
So I'm looking for the following:
1) Workarounds for the startd on the compute nodes, so that a slow
condor_schedd will not cause lease terminations like this (or with a
long timeout period)
2) Fixes for handling larger numbers of parallel jobs.
Any suggestions here (with #1 being highest priority)?
thanks,
rob
==========================
Robert E. Parrott, Ph.D. (Phys. '06)
Associate Director, Grid and
Supercomputing Platforms
Project Manager, CrimsonGrid Initiative
Harvard University Sch. of Eng. and App. Sci.
Maxwell-Dworkin 211,
33 Oxford St.
Cambridge, MA 02138
(617)-495-5045