[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Black hole node

Date: Mon, 23 Jan 2006 15:49:48 -0800
From: Terrence Martin <tmartin@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Black hole node

Alain Roy wrote:

   * There are lots black holes: machines that cause segfaults (how
     do you distinguish from a user job that just segfaults?),
     machines that cause jobs to run slowly (how do you distinguish
     from slow jobs?), and machines that cause jobs to exit quickly.

I agree that it's nice to have such a black hole system, but it'sdefinitely a challenge.

I am wondering if information collection of my cluster might be a goodplace to start to see if there is a pattern that blackholes exhibit thatmay be different from say a standard failing job. For example ablackhole would be user independent. For example a single users jobs alldisappearing in say 120s or less would indicate a specific users problemwhereas a node that gobbles up jobs irrespective of a user would flagmuch more strongly for being a blackhole. If there is a distinctivepattern then it might be easier to devise a counter measure.


Terrence

-alain



_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

References:
- [Condor-users] Black hole node
  - From: Terrence Martin
- Re: [Condor-users] Black hole node
  - From: Matt Hope
- Re: [Condor-users] Black hole node
  - From: Terrence Martin
- Re: [Condor-users] Black hole node
  - From: Alain Roy

Prev by Date: Re: [Condor-users] Strange toubles with Condor jobs submitted via globus
Next by Date: Re: [Condor-users] No matches being made
Previous by thread: Re: [Condor-users] Black hole node
Next by thread: Re: [Condor-users] Black hole node
Index(es):
- Date
- Thread