Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] New user question: Eviction of long jobs
- Date: Fri, 16 Apr 2010 14:53:00 -0500
- From: Alan De Smet <adesmet@xxxxxxxxxxx>
- Subject: Re: [Condor-users] New user question: Eviction of long jobs
Laura Balzano <sunbeam@xxxxxxxxxxxx> wrote:
> Hi all, I have just started using Condor for the first time. I submitted a
> matlab job that is going to take a long while (I estimate 8 hours). I saw
> that it already got evicted four times. I believe that means it had to
> start over on another machine, is that correct?
You understand correctly.
> Is there any way to tell condor to save my job for a machine/a
> time when it can run to completion?
Generally speaking, no. A particular installation might be able
to make commitments, but exactly how you would access those
resources would depend on your local configuration. I'm guessing
you're working on UW-Madison CAE resources, and I would recommend
asking the CAE if they can help. I don't actually know their
specific policies, but a common solution is to have some
dedicated computers that only run Condor jobs and generally won't
evict them. Those computers would be marked somehow, perhaps
with a setting like "IsCluster=TRUE". Then in your submit file
you might put something like "Requirements==IsCluster". But the
specifics will depend on the CAE's configuration, and I don't
know if they offer such a service.
I'm guessing the machines in question are more idle at night, so
if you haven't already, you might try leaving it alone overnight.
I would expect (but can't promise) that eventually you'll land on
a machine that's unoccupied and it will finish.
If none of the above helps, we may have additional Condor
computing for options UW-Madison users; contact us at
condor-admin@xxxxxxxxxxx
As a complex option, you might be able to use DMTCP
(http://dmtcp.sourceforge.net/) to checkpoint your job at regular
intervals, allowing it to restart when interrupted. Setting this
is fiddly, but possible.
> Also is there any one place where documentation on these
> various things resides?
Our manual is here: http://www.cs.wisc.edu/condor/manual/v7.4/
Unfortunately what you really need to know is about the CAE site
specific configuration. I don't know how the CAE manages or
documents that configuration.
--
Alan De Smet Condor Project Research
adesmet@xxxxxxxxxxx http://www.cs.wisc.edu/condor/