Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] jobs are being killed after 30-45 minutes

Date: Mon, 24 Jul 2006 17:02:07 +0100
From: Santanu Das <santanu@xxxxxxxxxxxxxxxxx>
Subject: [Condor-users] jobs are being killed after 30-45 minutes

Hi Guys,

A number of jobs are being killed after 30-45 minutes of running onour site.

An example is job (https://mu3.matrix.sara.nl:9000/yha-AADzc51_ytdhLM_9Ow) that ran on one of our WNs

(for which then RB reports the following timestamps):

      Submitted        : Sat Jul 22 21:56:19 2006
      Waiting          : Sat Jul 22 22:47:46 2006
      Ready            : Sat Jul 22 21:57:04 2006
      Scheduled        : Sat Jul 22 21:57:18 2006
      Running          : Sat Jul 22 21:59:57 2006
      Done             : Sat Jul 22 22:47:38 2006
      Cleared          :                ---
      Aborted          : Sat Jul 22 22:47:48 2006
      Cancelled        :                ---
(they are UTC)

Now, on that execute node,  StarterLog is pretty normal:


7/22 22:32:59 ******************************************************
7/22 22:32:59 ** condor_starter (CONDOR_STARTER) STARTING UP
7/22 22:32:59 ** /opt/condor-6.6.11/sbin/condor_starter
7/22 22:32:59 ** $CondorVersion: 6.6.11 Mar 23 2006 $
7/22 22:32:59 ** $CondorPlatform: I386-LINUX_RH9 $
7/22 22:32:59 ** PID = 28393
7/22 22:32:59 ******************************************************
7/22 22:32:59 Using config file: /etc/condor/condor_config

7/22 22:32:59 Using local config files: /home/condorr/condor_config.local

7/22 22:32:59 DaemonCore: Command Socket at <172.24.116.141:9583>
7/22 22:32:59 Done setting resource limits

7/22 22:32:59 Starter communicating with condor_shadow<172.24.116.151:9685>7/22 22:32:59 Submitting machine is "serv03--hep--phy.grid.private.cam.ac.uk"

7/22 22:32:59 VM1_USER set, so running job as condor_user1
7/22 22:32:59 File transfer completed successfully.
7/22 22:33:00 Starting a VANILLA universe job with ID: 5169.0
7/22 22:33:00 IWD: /home/condorr/execute/dir_28393

7/22 22:33:00 Output file: /home/condorr/execute/dir_28393/globus-cache-export.q10134.batch.out7/22 22:33:00 Error file: /home/condorr/execute/dir_28393/globus-cache-export.q10134.batch.err7/22 22:33:00 Using wrapper /opt/condor/etc/condor_job_wrapper.sh toexec /home/condorr/execute/dir_28393/condor_exec.exe

7/22 22:33:00 Create_Process succeeded, pid=28395

7/22 22:56:28 passwd_cache::cache_uid(): getpwnam("condor") failed:Success

except the last two lines: 7/22 22:56:28 passwd_cache::cache_uid():getpwnam("condor") failed: SuccessThis is not an NIS based set up. I don't have a user named "condor"anywhere but the condor is running as the same user everywhere.

This is what I have in my condor_config.local file on every executenode:


HEPCAM_CONTINUE			= ( $(CPUIdle) && ($(ActivityTimer) > 10) )
CONTINUE                        			= $(HEPCAM_CONTINUE)
KILL                      					= FALSE
PERIODIC_CHECKPOINT      		= FALSE

PREEMPT = ( (Activity == "Busy") && (State== "Claimed") && ($(ActivityTimer) > 2160) )

PREEMPTION_RANK                	= FALSE
PREEMPTION_REQUIREMENTS 	= FALSE
START                           			= TRUE
SUSPEND                         			= FALSE
WANT_SUSPEND                    		= FALSE
WANT_VACATE                     		= FALSE
SEC_DEFAULT_NEGOTIATION	= NEVER
USER_JOB_WRAPPER			= /opt/condor/etc/condor_job_wrapper.sh

ShadowLog on the submit host is not so good. I got a lot of theses inthere:



7/23 20:14:58 ******************************************************
7/23 20:14:58 ** condor_shadow (CONDOR_SHADOW) STARTING UP
7/23 20:14:58 ** /opt/condor-6.6.11/sbin/condor_shadow
7/23 20:14:58 ** $CondorVersion: 6.6.11 Mar 23 2006 $
7/23 20:14:58 ** $CondorPlatform: I386-LINUX_RH9 $
7/23 20:14:58 ** PID = 3661
7/23 20:14:58 ******************************************************
7/23 20:14:58 Using config file: /opt/condor/etc/condor_config

7/23 20:14:58 Using local config files: /home/condorr/condor_config.local

7/23 20:14:58 DaemonCore: Command Socket at <172.24.116.151:9530>
7/23 20:14:59 Initializing a VANILLA shadow

7/23 20:14:59 (7426.0) (3661): Request to run on<172.24.116.155:9653> was ACCEPTED

7/23 20:14:59 (7462.0) (26192): IO: Failed to read packet header

7/23 20:14:59 (7462.0) (26192): ERROR "Can no longer talk tocondor_starter on execute machine (172.24.116.161)" at line 63 infile NTreceivers.C7/23 20:14:59 passwd_cache::cache_uid(): getpwnam("condor") failed:Success



and this from the StartLog on the respective execute node:

7/23 20:14:59 Starter pid 28113 died on signal 11 (signal 11)
7/23 20:14:59 vm2: State change: starter exited
7/23 20:14:59 vm2: Changing activity: Busy -> Idle


Any idea what's going on? Thanks in advance for your help.

Cheers,
Santanu

Follow-Ups:
- [Condor-users] Only in Idle!!
  - From: mgm

References:
- Re: [Condor-users] [Condor-world] Announcing Condor Version 6.8.0!
  - From: Jason Stowe

Prev by Date: Re: [Condor-users] Condor 6.8 and BirdBath Problem
Next by Date: Re: [Condor-users] Urgent, any security breach?
Previous by thread: Re: [Condor-users] [Condor-world] Announcing Condor Version 6.8.0!
Next by thread: [Condor-users] Only in Idle!!
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] jobs are being killed after 30-45 minutes