[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_schedd slowness causing job leases to expire

Date: Wed, 19 Mar 2008 13:02:02 -0400
From: "Robert E. Parrott" <parrott@xxxxxxxxxxxxxxxx>
Subject: [Condor-users] condor_schedd slowness causing job leases to expire

Hi Folks,

I'm seeing what appears to be some some that are ended due to slowresponsiveness of the condor_schedd.

In particular, one user's parallel job was terminated when anotheruser submitted something like 1500 parallel jobs all at once.

The condor_schedd became unresponsive, and condor_q reported that thecondor_schedd didn't respond for a time.


This was on condor v 7.0.1 on the head node, 6.8.5 on the compute nodes.

So I'm looking for the following:

1) Workarounds for the startd on the compute nodes, so that a slowcondor_schedd will not cause lease terminations like this (or with along timeout period)


2) Fixes for handling larger numbers of parallel jobs.

Any suggestions here (with #1 being highest priority)?

thanks,
rob

==========================
Robert E. Parrott, Ph.D. (Phys. '06)
Associate Director, Grid and
       Supercomputing Platforms
Project Manager, CrimsonGrid Initiative
Harvard University Sch. of Eng. and App. Sci.
Maxwell-Dworkin  211,
33 Oxford St.
Cambridge, MA 02138
(617)-495-5045

Prev by Date: Re: [Condor-users] out-of-memory issues in parallel universe
Next by Date: Re: [Condor-users] out-of-memory issues in parallel universe
Previous by thread: Re: [Condor-users] result files upload problem
Next by thread: [Condor-users] How to submit a job via SOAP API
Index(es):
- Date
- Thread