My inbox was full of messages from my condor_schedd on the other
side of
the world telling me about problems with condor_shadow. The emails all
looked like this:
Subject: [Condor] Condor job 10725.239 put on hold
This is an automated email from the Condor system
on machine "pg-schedd1.altera.com". Do not reply.
Condor job 10725.239 has been put on hold.
No condor_shadow installed that supports vanilla jobs
on resources older than V6.3.3
Please correct this problem and release the job with
"condor_release"
My first thought was maybe the NFS file system where we host condor
went
down. Nope. I got smart to this years ago and now, on my central
servers, I keep Condor on local disk. So there's a copy of
condor_shadow
in /opt/condor/sbin. And it says it's 6.8.6 I386-LINUX_RHEL3 just like
it should.
Nothing has been changed in /opt/condor. Time stamps are fine.
Very mysterious. The emails happened around 7:00 am. I didn't see them
until 10:00 am. Looking at the queue on the scheduler now everything
is
either I or R, so it all got released automatically.
Can anyone offer some insight into what might have occurred here?
We've
*never* run anything older that 6.7.x at Altera. My guess is that this
message might get sent if a condor_shadow binary can't be found -- is
that possible? Someone /opt/condor/sbin/condor_shadow couldn't be seen
by the condor_schedd process running the machine perhaps?