This I could find on the schedd, which to me looks like a smoking gun:
Jun 19 06:54:21 msched condor_schedd[1005]: ERROR: Child pid 1211506 appears hung! Killing it hard.Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 successfully killed because the Shadow was hung.Jun 19 06:54:21 msched condor_schedd[1005]: Shadow pid 1211506 for job 360.0 exited with status 4Jun 19 08:57:23 msched condor_schedd[1005]: ERROR: Child pid 1216722 appears hung! Killing it hard.Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 successfully killed because the Shadow was hung.Jun 19 08:57:23 msched condor_schedd[1005]: Shadow pid 1216722 for job 360.0 exited with status 4Jun 19 11:09:23 msched condor_schedd[1005]: ERROR: Child pid 1221143 appears hung! Killing it hard.Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 successfully killed because the Shadow was hung.Jun 19 11:09:23 msched condor_schedd[1005]: Shadow pid 1221143 for job 360.0 exited with status 4
Is this a local problem on the schedd machine running the shadow daemon?At the same time I get this on the execution hosts:
Jun 19 06:54:21 pssproto04 condor_starter[24368]: Connection to shadow may be lost, will test by sending whoami request.Jun 19 08:57:23 pssproto04 condor_starter[7887]: Connection to shadow may be lost, will test by sending whoami request.Jun 19 11:09:23 pssproto04 condor_starter[24220]: Connection to shadow may be lost, will test by sending whoami request.[...]
To be network or resource related the time intervals are too even, I think. Any ideas?