Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [condor-users] jobs die due to low free memory
- Date: Tue, 6 Apr 2004 12:26:37 +0200 (MEST)
- From: "Anika Boehm" <anika.boehm@xxxxxxx>
- Subject: Re: [condor-users] jobs die due to low free memory
Hi Chris,
thanks for your hints. I hoped there would be a "cleaner" solution than
adding a command to one's submit file.
By the way: the expression
on_exit_remove = (ExitBySignal == True) && (ExitSignal != 4)
doesn't release a successfully done job (ExitCode = 0) from the queue (at
least not with condor-6.5.3@Solaris). The expression still evaluates to FALSE.
It seems
on_exit_remove = ExitCode =?= 0
is the solution to my problem.
Cheers
Anja
>
> I don't know whether setting the ImageSize macro in the submit file
> would help in this situation.
>
> ImageSize : Estimate of the memory image size of the job in kbytes. The
> initial estimate may be specified in the job submit file. Otherwise, the
> initial value is equal to the size of the executable. When the job
> checkpoints, the ImageSize attribute is set to the size of the
> checkpoint file (since the checkpoint file contains the job's memory
> image).
>
> This may also be a case of what UW calls a 'black hole' machine. Even
> if it's not a real black hole, putting this statement in the submit file
> will prevent the uncompleted job from being removed from the queue if it
> took less than 10 minutes to run:
>
> on_exit_remove = (CurrentTime - JobStartDate) > (10 * 60)
>
>
> From the manual:
>
> on_exit_remove = ClassAd Boolean Expression
> This expression is checked when the job exits and if true, then it
> allows the job to leave the queue normally. If false, then the job is
> placed back into the Idle state. If the user job is a vanilla job then
> it restarts from the beginning. If the user job is a standard job, then
> it restarts from the last checkpoint.
>
> For example: Suppose you have a job that occasionally segfaults but
> you know if you run it again on the same data, chances are it will
> finish successfully. This is how you would represent that with
> on_exit_remove(assuming the signal identifier for segmentation fault is
> 4):
>
> on_exit_remove = (ExitBySignal == True) && (ExitSignal != 4)
>
> The above expression will not let the job exit if it exited by a
> signal and that signal number was 4(representing segmentation fault). In
> any other case of the job exiting, it will leave the queue as it
> normally would have done.
>
> If left unspecified, this will default to True.
>
> periodic_ expressions(defined elsewhere in this man page) take
> precedent over on_exit_ expressions and a _hold expression takes
> precedent over a _remove expression.
>
> This expression is available for the vanilla and java universes. It
> is additionally available, when submitted from a Unix machine, for the
> standard universe.
>
>
> --
> Chris Horn
> p: 703.413.1100 x5193
> f: 703.413.8111
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
>
--
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz
Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>