Hi,
I am writing a small frontend for bioninformatics tasks, which will be
used by users which are rather unaware of the cluster behind it.
Since the cluster (128 CPU) should work without continuous supervision,
I made some torture tests with many very small jobs. The results are
zombie jobs which ahve been finished successfully, but are still noted
as running on their nodes, slowly blocking the whole cluster.
Questions:
- Can it be avoided ?
- If not: Is there a better way to get the system back in sync than to
remove all jobs with the forcex option?