Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Standard Universe and Job Hooks (condor_starter vs condor_starter.std)
- Date: Wed, 13 Apr 2011 18:23:37 +0200
- From: "Joan J. Piles" <jpiles@xxxxxxxxx>
- Subject: Re: [Condor-users] Standard Universe and Job Hooks (condor_starter vs condor_starter.std)
Hi Todd,
Our primary focus is checkpoint, we have a shared filesystem, so I think
remote I/O is not needed at all (I think checkpointing takes care of
re-opening file descriptors). A checkpointing system for Vanilla
universe jobs would be the perfect solution for us (and, in fact, we
made some tests with dmtcp and some bash wrappers, but it wasn't ready
yet four our needs).
Furthermore, as our jobsizes can be somewhat big (~20G aren't that
unusual here), we want to avoid periodic checkpointing and only
checkpoint on evictions (just in case this makes things easier).
If there is any other way to run a command whenever a job finishes (with
access to the classad of the job just run), it would fit our needs as
well (perhaps some magic with USER_JOB_WRAPPER and bash scripting? I
don't know if a wrapper makes sense for an Standard job).
Regards,
Joan
El 13/04/11 17:56, Todd Tannenbaum escribió:
Joan J. Piles wrote:
Hi all,
We have a hook that must be called for each job running in our
cluster, an instance of xxxxx_HOOK_JOB_EXIT. In the Vanilla universe
(the one most of our jobs use), there is no problem, and it works
almost as expected (I say almost because the exit reason is shown as
"evict" even when "condor_rm" is used, but that's not an important
problem for us).
We have recently found that this hook is completely ignored for
Standard universe jobs. According to the documentation it should
work, and it is condor_starter's job to run the hooks. However, there
seem to be two condor_starter executables, one for most jobs, and
another one (condor_starter.std) for Standard universe jobs.
Furthermore, in the sourece code there are two completely different
implementations, and the Standard universe one seems to have no hook
capability at all, so I don't know if this is a bug or a feature ;-)
What are our options for implementing hooks for Standard Universe
jobs? Is this being worked upon (in development versions), or we
should find a workaround? We already tried ditching
condor_starter.std, but the default condor_starter doesn't seem to be
able to start Standard Universe jobs.
Thanks in advance,
Joan
Hi Joan -
You are correct, standard universe has its own shadow/starter pair
that does not support a bunch of mechanisms found in the newer
shadow/starter pair that supports other universes like Vanilla, Java,
etc. Besides hooks, other features like ssh_to_job and CCB do not
work in standard universe for this reason.
We are currently actively looking at moving some functionality from
the standard universe starter/shadow into the newer starter/shadow. (
For some details, see some thinking we did on this a couple weeks ago
at https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1956,67 ).
Question: do you primarily use standard universe for checkpointing, or
do you rely on remote system calls as well? I ask because another
option we are considering is to add support to the vanilla universe to
easily handle standalone checkpointing where some signal is sent
periodically to create a ckpt file in the vanilla job's output
sandbox, whether the executable is linked w/ Condor's standalone
checkpointing library or some other one.
regards,
Todd
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
--------------------------------------------------------------------------
Joan Josep Piles Contreras - Analista de sistemas
I3A - Instituto de Investigación en Ingeniería de Aragón
Tel: 976 76 10 00 (ext. 5454)
http://i3a.unizar.es -- jpiles@xxxxxxxxx
--------------------------------------------------------------------------