Hi Eric and Ralph,Thanks for your respective messages. I now understand better the idea of using two VMs per processor and how this could indeed lead to a solution. However, I still don't understand why a more simple solution, such as the one suggested by Ralph, would not work. To be clear, I don't know why Condor decides to evict the long jobs (say, around 15 hours). It could be keyboard activity, as suggested. However, it could also be due to user priorities (this is probably more likely). Recall that this job is running in a heavily loaded Condor cluster (several users, dispatch queue with large backlog), which could make the long job receive low priority (over time) compared to new submitted jobs by users with few jobs. Can this case also be handled with a similar approach as suggested by Ralph? If not, is this why we need the VM approach?
Sorry for the long exchange of messages in resolving this issue, but I would like to understand what is going on here.
Thanks, Daniel On Sun, 4 Dec 2005, Finch, Ralph wrote:
I don't think Daniel needs two VMs; he simply wants his one job to suspend for some reason, then resume when the "reason" no longer applies. Looking at his original post, Daniel said: "The problem is that after the job has been running for some hours (say 10 hours) Condor decides to evict the job from the machine." Why it gets evicted is not said, so we don't know the criteria for suspending a job. I'll assume keyboard activity. Then "the minimal set of configuration fields that must be changed in order to achieve [suspension instead of eviction]" is: WANT_SUSPEND = TRUE PREEMPT = FALSE PREEMPTION_REQUIREMENTS = FALSE KILL = FALSE ContinueIdleTime = 5 * $(MINUTE) SUSPEND = $(KeyboardBusy) CONTINUE = (KeyboardIdle > $(ContinueIdleTime)) Ralph Finch, P.E. Dept. of Water Resources Bay-Delta Office, Room 215-13 Sacramento, CA 95814 916-653-7552 rfinch@xxxxxxxxxxxx-----Original Message----- From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Erik Paulson Sent: Saturday, December 03, 2005 11:39 AM To: Condor-Users Mail List Subject: Re: [Condor-users] Running long jobs On Sat, Dec 03, 2005 at 07:01:43PM +0100, Daniel R Figueiredo wrote:On Wed, 30 Nov 2005, Erik Paulson wrote: Thanks for your message. It's now clear that I'll needsupport from theCondor administrator. However, I looked through the report"Condor and TheBolonga Batch System" as you suggested, but it was not clear how to configurate Condor to run long jobs with preemption implemented via suspension (as opposed to preemption via termination). Inparticular, Iwould like to know what is the minimal set of configurationfields thatmust be changed in order to achieve this? Recall that Iwould like forlong jobs to be preempted via suspension (as opposed toterminated througha signal) and later resume from where they stopped (as opposed to restarting from the beginning). Any ideas on how to this? Icould thensuggest something concrete to our local Condor administrator.You need to create 2 VMs. There is no way to have one VM suspend a job, start another one, and resume the first one later resume it later - if a job has state on a machine, it must have a VM watching over it, and a VM can only watch over one job at a time. You can emulate your desired behaviour with 2 VMs - the second VM can be configured to suspend the job whenever it sees the state of the first VM switch to "Claimed". The BBS document should give you all of the details you need. -Erik _______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users