Hello, we have condor 7.4.1 running and observed that on the nodes running a startd the condor_master process is stopping with exit code 0 and starting from time to time. This happens on arbitrary nodes at arbitrary time. We have not been able yet to correlate this with a particular kind of jobs. We increased the verbosity on some nodes and collected the logs. I took the time around such an event and put the CKPTLog, MasterLog and StartLog of the startd node and the CollectorLog of the submit host into a tar ball: http://atlas1.atlas.aei.uni-hannover.de/~fehrmann/condor_log.tgz Unfortunately, we have been too slow - the log rotate erased the corresponding events in the StarterLogs. If you need the configuration or more logging please tell us. Thank you and cheers, Henning
Attachment:
signature.asc
Description: Digital signature