[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Standard Universe and Job Hooks (condor_starter vs condor_starter.std)
- Date: Wed, 13 Apr 2011 10:56:28 -0500
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Standard Universe and Job Hooks (condor_starter vs condor_starter.std)
Joan J. Piles wrote:
Hi all,
We have a hook that must be called for each job running in our cluster,
an instance of xxxxx_HOOK_JOB_EXIT. In the Vanilla universe (the one
most of our jobs use), there is no problem, and it works almost as
expected (I say almost because the exit reason is shown as "evict" even
when "condor_rm" is used, but that's not an important problem for us).
We have recently found that this hook is completely ignored for Standard
universe jobs. According to the documentation it should work, and it is
condor_starter's job to run the hooks. However, there seem to be two
condor_starter executables, one for most jobs, and another one
(condor_starter.std) for Standard universe jobs. Furthermore, in the
sourece code there are two completely different implementations, and the
Standard universe one seems to have no hook capability at all, so I
don't know if this is a bug or a feature ;-)
What are our options for implementing hooks for Standard Universe jobs?
Is this being worked upon (in development versions), or we should find a
workaround? We already tried ditching condor_starter.std, but the
default condor_starter doesn't seem to be able to start Standard
Universe jobs.
Thanks in advance,
Joan
Hi Joan -
You are correct, standard universe has its own shadow/starter pair that
does not support a bunch of mechanisms found in the newer shadow/starter
pair that supports other universes like Vanilla, Java, etc. Besides
hooks, other features like ssh_to_job and CCB do not work in standard
universe for this reason.
We are currently actively looking at moving some functionality from the
standard universe starter/shadow into the newer starter/shadow. ( For
some details, see some thinking we did on this a couple weeks ago at
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1956,67 ).
Question: do you primarily use standard universe for checkpointing, or
do you rely on remote system calls as well? I ask because another
option we are considering is to add support to the vanilla universe to
easily handle standalone checkpointing where some signal is sent
periodically to create a ckpt file in the vanilla job's output sandbox,
whether the executable is linked w/ Condor's standalone checkpointing
library or some other one.
regards,
Todd