Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Standalone checkpoint error ...
- Date: Fri, 3 Feb 2006 14:09:42 -0600
- From: Erik Paulson <epaulson@xxxxxxxxxxx>
- Subject: Re: [Condor-users] Standalone checkpoint error ...
On Fri, Feb 03, 2006 at 03:29:44PM +0000, Goncalo Borges wrote:
>
> Hello everybody,
>
> I'm trying to use the standalone checkpoint features provided by condor in
> our cluster. Here are the features of our machines:
>
> [goncalo@lflip02 ~]$ uname -a
> Linux lflip02.lip.pt 2.4.21-32.0.1.ELsmp #1 SMP Wed May 25 15:42:26 CDT
> 2005 i686 i686 i386 GNU/Linux
>
That kernel probably has address space randomization enabled, which causes
problems with Condor checkpointing (things aren't where we expect them)
<...>
>
> I have compiled the ever.c program:
>
<...>
> When I test the program interactively, it stars running with
> the right messages:
>
> [goncalo@lflip02 ~]$ ./ever
> Condor: Notice: Will checkpoint to ./ever.ckpt
> Condor: Notice: Remote system calls disabled.
>
> Then, after login in in other console, I do a "kill -s USR2 <pid>".
> The programs is stopped with a segmentation fault error and it creates a
> ever.ckpt.tmp file.
>
> [goncalo@lflip02 ~]$ ./ever
> Condor: Notice: Will checkpoint to ./ever.ckpt
> Condor: Notice: Remote system calls disabled.
> Segmentation fault (core dumped)
>
Yeah, that's not what you should see.
>
> Then, I try to restart the program using the ever.ckpt.tmp file but it is
> immediatelly killed.
>
Yup, the .tmp file isn't a complete checkpoint.
> [goncalo@lflip02 ~]$ ./ever -_condor_restart ever.ckpt.tmp
> Condor: Notice: Will restart from ever.ckpt.tmp
> Killed
>
> I guess this is not the expected behaviour. Maybe there is an obvious
> reason why this is happening, which I'm forgetting.
>
You need to run your program under the old memory layout:
[goncalo@lflip02 ~]$ setarch i386 ./ever
and then, to restart,
[goncalo@lflip02 ~]$ setarch i386 ./ever -_condor_restart ever.ckpt
(Condor automatically does the equivelent of a 'setarch i386' before running
standard universe jobs, which is why it works inside of Condor)
-Erik