Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor termsig = 11 on checkpoint restart

Date: Thu, 2 Jun 2005 18:38:56 -0500
From: Naveen Neelakantam <neelakan@xxxxxxxx>
Subject: [Condor-users] condor termsig = 11 on checkpoint restart

Hello,

First let my start off by saying that I have been a very happy condor user for over a year now. However, a few of our machines got upgraded to Fedora Core 3 and all hell has broken loose on our condor install. Now, I am a frustrated condor user who is desperately trying to be happy again. :-)

I am having problems with condor version 6.7.7 and standard universe jobs exiting with signal 11 after restarting from a checkpoint. Every time the problem occurs I see something like the following in the shadow log:

6/2 17:26:37 (10.5) (312):Read: Done restoring file state 6/2 17:26:37 (10.5) (312):Read: About to restore signal state 6/2 17:26:37 (10.5) (312):Read: About to return to user code 6/2 17:26:37 (10.5) (312):Shadow: Job 10.5 exited, termsig = 11, coredump = 0, retcode = 0 6/2 17:26:37 (10.5) (312):Shadow: was killed by signal 11.

Specifically the problem only occurs when a job running on a Fedora Core 3 node gets checkpointed and restarted on one of our Redhat Linux 7.3 nodes (and possibly vice versa).

Some info about our setup: -Negotiator/Collector is running Redhat Enterprise Linux AS release 3 and is not a member of the pool (not running condor_startd) -compute nodes are either running Redhat Linux 7.3 or Fedora Core (1 and 3) -all nodes were installed using the 6.7.7 dynamic lib rpm's (glibc22 or glibc23 where appropriate) -USER_JOB_WRAPPER is NOT set

I had previously been using condor version 6.6.9 with the USER_JOB_WRAPPER variable set as mentioned in order to workaround the 2.6 kernel exec_shield problem. However, I would still see the same symptoms (termsig = 11 on checkpoint restart).

In order to test if checkpointing works from the command line I tried the following: fedora_node> setarch i386 ./myapp fedora_node> kill -TSTP <myapp_pid> RHL_7_3_node> ./myapp -_condor_restart myapp.ckpt And the app segfaulted.

Any help would be greatly appreciated!

Thanks,
Naveen

Follow-Ups:
- Re: [Condor-users] condor termsig = 11 on checkpoint restart
  - From: Naveen Neelakantam
- Re: [Condor-users] condor termsig = 11 on checkpoint restart
  - From: Dan Christensen

Prev by Date: Re: [Condor-users] state file location in CondorG
Next by Date: Re: [Condor-users] Condorg: Permission denied (publickey,password)
Previous by thread: Re: [Condor-users] Condor-g : Errorno 111
Next by thread: Re: [Condor-users] condor termsig = 11 on checkpoint restart
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] condor termsig = 11 on checkpoint restart