Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Problems with power outage etc
- Date: Thu, 6 Jun 2013 13:06:28 +0000
- From: "Cotton, Benjamin J" <bcotton@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Problems with power outage etc
Peter,
It sounds like your infrastructure is a little bit fragile. Without
knowing more about it or your jobs, here are my suggestions for
mitigating lost compute time:
* Checkpoint jobs by relinking against the Condor libraries[1] (for
standard universe jobs) or using a third-party wrapper[2] (for vanilla
universe jobs).
* If the power outages are brief, a UPS might work for your small
cluster. If you can't get a enough battery to support the entire
cluster, you can put a subset of nodes on UPS an use a custom classad to
indicate which ones have battery. You can have your higher-priority jobs
prefer the UPS'ed hosts.
* If the power situation is unmanageable, you might consider running on
another resource (e.g. Open Science Grid, Amazon EC2)
[1]http://research.cs.wisc.edu/htcondor/manual/v7.8/2_4Road_map_Running.html#SECTION00341100000000000000
[2]http://dmtcp.sourceforge.net/condor.html
Hope this helps!
BC
--
Ben Cotton
Purdue University