Hi,
A user's application keeps exiting with the following message in the
SchedLog on the submitting machine:
2/2 16:56:10 Shadow pid 12591 for job 2605.0 exited with status 4
2/2 16:56:10 ERROR: Shadow exited with job exception code!
However, the job then gets immediately resubmitted, leading to a
perpetual cycle. The StarterLog on the execute machine shows nothing
unusual, but the StartLog reports:
2/2 16:56:10 Starter pid 19086 died on signal 11 (signal 11).
That's a segment violation there. My question is, is that Condor's way
of telling me that the user's application is segmenting, or the Start
daemon itself? We see this behaviour on a number of linux boxes, all
running dynamically linked versions of Condor 6.6.8 (seen it with 6.6.7
too), for glibc 2.2 and 2.3.
Help please, chaps.
Cheers,
Mark
|