Hi:
So I've been trying to manage a small Condor pool (~200 cores) over the last little while, and I've run into a small irritating issue, and wondered if others have experience the same thing, or if they have solutions/ideas.
So I have configured the pool to do various helpful things, like accept GPU jobs, provide dynamic slots on some of the more capable machines, etc. What I have found though, is that once I've set the configuration, I rarely revisit it. This means that if it stops working, I won't know until someone complains. This might contribute to a decreased workload, since if no one complains, then it does not need to be fixed; however, it is more generally the case that I do get complaints, and generally they arrive in my inbox near strict deadlines (not that anyone ever leaves things to the last minute :P).
Does anyone have a relatively simple system to continuously test their pool's services? Ideally, I'd like the test jobs to run with very low priority, so as not to interfere with regular workloads, but would like them to run at least once a day (or as often as practically possible), and keep track of the results (this could just be an email, or a log file). Then, if one job fails, I'd like to be emailed about it.
I can think of a few approaches myself, but I thought I'd ask if anyone has already got something similar up and running.