Matt Hope wrote: > On 8/1/06, Michael Thomas <thomas@xxxxxxxxxxxxxxx> wrote: > >>We recently had a disk problem with one of the 60 machines in our condor >>pool that caused jobs to fail quickly. As a result, most jobs ended up >>landing on this node, which generated a large number of failed jobs out >>of the total job submissions. Unfortuantely, we were not aware of this >>failing node until we took a long look at the job output logs. >> >>What kind of tools does condor provide for monitoring things like: >>* Average job time to completion per node >>* Number of jobs processed per node >> >>Any sort of host-level monitoring information that we can get from >>condor would be useful to plug into a higher-level monitoring system >>like MonALISA, and allow us to detect such problems as they occur and >>not days after the fact. > > > We had a similar problem a while back. > > Whilst general solutions are all nice disk failure will almost > certainly be the biggest cause of 'black holes' on your pool > Black holes are machines which accept a job but always fail to run it > properly - often very fast thus sending loads of you patiently queued > jobs into a black hole. While disk failures may be the biggest black hole cause, I'm also interested in looking at a general solution. > On the Dell machines this event mechanism is very fast (seconds) > whereas on the HP's it can be as much as 5 mins. Even a 5 minute delay would be preferable to the situation I have now. But I see your point in using a system-level error checking script which can automatically update the condor classad for that machine. I got the same suggestion on another list. Additionally, one thing I would really like to see is a way to get these per-host statistics into a higher level monitoring infrastructure like MonALISA. I already monitor the cluster load, network IO, and per-VO jobs in MonALISA. If condor provided a way to obtain # jobs completed per node, and average time to completion per node, it would help me to detect both a black hole and underperforming nodes at the same time. --Mike
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature