Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Make runs fail?
- Date: Fri, 19 Oct 2018 15:52:17 +0000
- From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Make runs fail?
Hi,
Some quick thoughts on the below --
1. Whenever I hear someone saying they are doing a parameter sweep from
a probability distribution and want to throw out the jobs that run for a
long time (or use a lot of memory, or whatever), alarm bells go off. Be
aware that by removing these longer running jobs, you may be introducing
a significant statistical bias in your sample that could render your
scientific results as flawed! Proceed with caution....
2. Perhaps instead of doing "condor_rm 1.10", you could do "condor_hold
1.10" which would free the cores and hopefully sstill keep
PEST/YAMR/Panter from submitting another jobs with the same parameters?
Then once all your modeling work is done, you could simply condor_rm
everything.
3. How do you go about telling if the jobs will be worthless to you
after the first hour? I ask because perhaps it can be something that is
easily automated... ie you can tell HTCondor to automatically
condor_hold or condor_rm jobs that run for more than hour...
Hope the above thoughts help,
Todd
On 10/19/2018 9:58 AM, Kitlasten, Wesley via HTCondor-users wrote:
> Hello,
>
> I am using the parameter estimation software PEST to run multiple models
> (jobs). PEST uses YAMR and Panther, although I struggle to make sense of
> how everything works together.
>
> The parameters are determined from a probability distribution. Some
> parameter combinations (jobs) can take 12+ hours to run, and from previous
> experience I can tell the results of those runs will be worthless to
> me. I can usually
> tell which jobs will be useless within the first hour. I would like to
> remove these jobs after about 1 hours to free up cores for other runs
> but have them returned as failed jobs and not be resubmitted to the pool.
>
> For example, if I type "condor_rm 1.10" it will
> remove that job, but the model with those parameters will just be
> resubmitted to another node and start over. However, if the job truly
> fails a job with
> those parameters will not be resubmitted.
>
> Is there a way to remove a job and have condor return a failed status,
> rather than have the same parameters run under a different job name?
>
> References:
> https://github.com/dwelter/pestpp
> https://github.com/jtwhite79/pestpp/tree/master/bin/iwin
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>
--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing Department of Computer Sciences
HTCondor Technical Lead 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685