| Mailing List ArchivesAuthenticated access |  | ![[Computer Systems Lab]](http://www.cs.wisc.edu/pics/csl_logo.gif)  | 
 
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor termsig = 11 on checkpoint restart
- Date: Thu, 9 Jun 2005 11:04:46 -0500
- From: Naveen Neelakantam <neelakan@xxxxxxxx>
- Subject: Re: [Condor-users] condor termsig = 11 on checkpoint restart
Hmmm, perhaps my description is too vague to attract a response.
Let's start with this.  How can I manually test if checkpointing is  
working?  I'd like to test checkpointing a job on a FC3 node and  
restarting it on a RH7.3 node.  Again, I am using condor 6.7.7  
installed from rpms.
Naveen
On Jun 2, 2005, at 6:38 PM, Naveen Neelakantam wrote:
Hello,
First let my start off by saying that I have been a very happy  
condor user for over a year now.  However, a few of our machines  
got upgraded to Fedora Core 3 and all hell has broken loose on our  
condor install.  Now, I am a frustrated condor user who is  
desperately trying to be happy again.  :-)
I am having problems with condor version 6.7.7 and standard  
universe jobs exiting with signal 11 after restarting from a  
checkpoint.  Every time the problem occurs I see something like the  
following in the shadow log:
6/2 17:26:37 (10.5) (312):Read: Done restoring file state
6/2 17:26:37 (10.5) (312):Read: About to restore signal state
6/2 17:26:37 (10.5) (312):Read: About to return to user code
6/2 17:26:37 (10.5) (312):Shadow: Job 10.5 exited, termsig = 11,  
coredump = 0, retcode = 0
6/2 17:26:37 (10.5) (312):Shadow: was killed by signal 11.
Specifically the problem only occurs when a job running on a Fedora  
Core 3 node gets checkpointed and restarted on one of our Redhat  
Linux 7.3 nodes (and possibly vice versa).
Some info about our setup:
-Negotiator/Collector is running Redhat Enterprise Linux AS release  
3 and is not a member of the pool (not running condor_startd)
-compute nodes are either running Redhat Linux 7.3 or Fedora Core  
(1 and 3)
-all nodes were installed using the 6.7.7 dynamic lib rpm's  
(glibc22 or glibc23 where appropriate)
-USER_JOB_WRAPPER is NOT set
I had previously been using condor version 6.6.9 with the  
USER_JOB_WRAPPER variable set as mentioned in order to workaround  
the 2.6 kernel exec_shield problem.  However, I would still see the  
same symptoms (termsig = 11 on checkpoint restart).
In order to test if checkpointing works from the command line I  
tried the following:
fedora_node> setarch i386 ./myapp
fedora_node> kill -TSTP <myapp_pid>
RHL_7_3_node> ./myapp -_condor_restart myapp.ckpt
And the app segfaulted.
Any help would be greatly appreciated!
Thanks,
Naveen
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users