Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Running long jobs
- Date: Mon, 5 Dec 2005 18:20:34 -0800
- From: "Finch, Ralph" <rfinch@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] Running long jobs
Do you have the condor_config and condor_config.local files you could
post or email?
The log files will show why a job was preempted, either MasterLog or
StartLog, I forget which. You'll probably have to ask your condor admin
for them.
Ralph Finch, P.E.
Dept. of Water Resources
Bay-Delta Office, Room 215-13
Sacramento, CA 95814
916-653-7552
rfinch@xxxxxxxxxxxx
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Daniel
> R Figueiredo
> Sent: Monday, December 05, 2005 2:51 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Running long jobs
>
> Hi Eric and Ralph,
>
> Thanks for your respective messages. I now understand better
> the idea of
> using two VMs per processor and how this could indeed lead to
> a solution.
> However, I still don't understand why a more simple solution,
> such as the
> one suggested by Ralph, would not work. To be clear, I don't know why
> Condor decides to evict the long jobs (say, around 15 hours).
> It could be
> keyboard activity, as suggested. However, it could also be
> due to user
> priorities (this is probably more likely). Recall that this
> job is running
> in a heavily loaded Condor cluster (several users, dispatch
> queue with
> large backlog), which could make the long job receive low
> priority (over
> time) compared to new submitted jobs by users with few jobs.
> Can this case
> also be handled with a similar approach as suggested by
> Ralph? If not, is
> this why we need the VM approach?
>
> Sorry for the long exchange of messages in resolving this
> issue, but I
> would like to understand what is going on here.
>
> Thanks,
> Daniel
>
>
>
> On Sun, 4 Dec 2005, Finch, Ralph wrote:
>
> > I don't think Daniel needs two VMs; he simply wants his one job to
> > suspend for some reason, then resume when the "reason" no longer
> > applies.
> >
> > Looking at his original post, Daniel said:
> >
> > "The problem is that after the job has been running for
> some hours (say
> > 10 hours) Condor decides to evict the job from the machine."
> >
> > Why it gets evicted is not said, so we don't know the criteria for
> > suspending a job. I'll assume keyboard activity. Then "the
> minimal set
> > of configuration fields that must be changed in order to achieve
> > [suspension instead of eviction]" is:
> >
> > WANT_SUSPEND = TRUE
> > PREEMPT = FALSE
> > PREEMPTION_REQUIREMENTS = FALSE
> > KILL = FALSE
> >
> > ContinueIdleTime = 5 * $(MINUTE)
> > SUSPEND = $(KeyboardBusy)
> > CONTINUE = (KeyboardIdle > $(ContinueIdleTime))
> >
> > Ralph Finch, P.E.
> > Dept. of Water Resources
> > Bay-Delta Office, Room 215-13
> > Sacramento, CA 95814
> > 916-653-7552
> > rfinch@xxxxxxxxxxxx
> >
> >
> >> -----Original Message-----
> >> From: condor-users-bounces@xxxxxxxxxxx
> >> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson
> >> Sent: Saturday, December 03, 2005 11:39 AM
> >> To: Condor-Users Mail List
> >> Subject: Re: [Condor-users] Running long jobs
> >>
> >> On Sat, Dec 03, 2005 at 07:01:43PM +0100, Daniel R
> Figueiredo wrote:
> >>>
> >>> On Wed, 30 Nov 2005, Erik Paulson wrote:
> >>>
> >>> Thanks for your message. It's now clear that I'll need
> >> support from the
> >>> Condor administrator. However, I looked through the report
> >> "Condor and The
> >>> Bolonga Batch System" as you suggested, but it was not
> clear how to
> >>> configurate Condor to run long jobs with preemption
> implemented via
> >>> suspension (as opposed to preemption via termination). In
> >> particular, I
> >>> would like to know what is the minimal set of configuration
> >> fields that
> >>> must be changed in order to achieve this? Recall that I
> >> would like for
> >>> long jobs to be preempted via suspension (as opposed to
> >> terminated through
> >>> a signal) and later resume from where they stopped (as opposed to
> >>> restarting from the beginning). Any ideas on how to this? I
> >> could then
> >>> suggest something concrete to our local Condor administrator.
> >>>
> >>
> >> You need to create 2 VMs. There is no way to have one VM
> >> suspend a job, start
> >> another one, and resume the first one later resume it later -
> >> if a job has
> >> state on a machine, it must have a VM watching over it, and a
> >> VM can only
> >> watch over one job at a time.
> >>
> >> You can emulate your desired behaviour with 2 VMs - the
> >> second VM can be
> >> configured to suspend the job whenever it sees the state of
> >> the first VM
> >> switch to "Claimed". The BBS document should give you all of
> >> the details you
> >> need.
> >>
> >> -Erik
> >> _______________________________________________
> >> Condor-users mailing list
> >> Condor-users@xxxxxxxxxxx
> >> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
>