My group uses Condor on a local cluster. Our set-up is relatively simple and our use of Condor is rather basic. The types of calculations we run cannot be checkpointed in the way that Condor would like, so we have turned off all preemption options.
Here's our unexplainable event. Occasionally, jobs stop mid-stream and the Condor log file reports:
--------------------------START OF FILE----------------------------- 000 (12841.000.000) 12/13 14:45:37 Job submitted from host: <
10.79.133.101:33346> ... 001 (12841.000.000) 12/13 14:45:42 Job executing on host: <10.79.133.112:32812> ... 006 (12841.000.000
) 12/13 14:45:50 Image size of job updated: 1729736 ... 006 (12841.000.000) 12/13 15:05:50 Image size of job updated: 1731508 ... 007 (12841.000.000) 12/15 23:51:00 Shadow exception! Can no longer talk to condor_starter on execute machine (
10.79.133.112) 0 - Run Bytes Sent By Job 6891 - Run Bytes Received By Job ... ------------------------END OF FILE-------------------------------------
Can someone explain what message 007 means and what sorts of pathologies this indicates?
Regards, HPH
-- Hrant P. Hratchian, Ph.D. E. R. Davidson Fellow Department of Chemistry
Indiana University Bloomington, Indiana 47405-7102 812.856.0829 hhratchi@xxxxxxxxxxx