Alain Roy wrote:
* There are lots black holes: machines that cause segfaults (how do you distinguish from a user job that just segfaults?), machines that cause jobs to run slowly (how do you distinguish from slow jobs?), and machines that cause jobs to exit quickly.
I agree that it's nice to have such a black hole system, but it's definitely a challenge.I am wondering if information collection of my cluster might be a good place to start to see if there is a pattern that blackholes exhibit that may be different from say a standard failing job. For example a blackhole would be user independent. For example a single users jobs all disappearing in say 120s or less would indicate a specific users problem whereas a node that gobbles up jobs irrespective of a user would flag much more strongly for being a blackhole. If there is a distinctive pattern then it might be easier to devise a counter measure.
Terrence
-alain _______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users