We recently had a disk problem with one of the 60 machines in our condor pool that caused jobs to fail quickly. As a result, most jobs ended up landing on this node, which generated a large number of failed jobs out of the total job submissions. Unfortuantely, we were not aware of this failing node until we took a long look at the job output logs. What kind of tools does condor provide for monitoring things like: * Average job time to completion per node * Number of jobs processed per node Any sort of host-level monitoring information that we can get from condor would be useful to plug into a higher-level monitoring system like MonALISA, and allow us to detect such problems as they occur and not days after the fact. --Mike
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature