Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] jobs terminated with SIGQUIT signal
- Date: Fri, 19 Dec 2008 10:04:40 -0500
- From: Yu Fu <yfu@xxxxxxxxxxxx>
- Subject: [Condor-users] jobs terminated with SIGQUIT signal
Hi there,
We got a strange problem on our systems: jobs are terminated with
SIGQUIT signal a couple of minutes after started. This happens from time
to time, its seems all jobs are affected and no jobs can be executed for
more than 10 minutes. Below is a typical log on the worknode's
StarterLog.slot*:
12/19 07:40:03 ******************************************************
12/19 07:40:03 ** condor_starter (CONDOR_STARTER) STARTING UP
12/19 07:40:03 ** /opt/condor-7.0.4/sbin/condor_starter
12/19 07:40:03 ** $CondorVersion: 7.0.4 Jul 16 2008 BuildID: 95033 $
12/19 07:40:03 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
12/19 07:40:03 ** PID = 7352
12/19 07:40:03 ** Log last touched 12/19 07:39:58
12/19 07:40:03 ******************************************************
12/19 07:40:03 Using config source: /etc/condor/condor_config
12/19 07:40:03 Using local config sources:
12/19 07:40:03 /opt/condor/condor_config.local
12/19 07:40:03 DaemonCore: Command Socket at <128.227.221.104:59102>
12/19 07:40:03 Done setting resource limits
12/19 07:40:03 Communicating with shadow <128.227.221.12:36599>
12/19 07:40:03 Submitting machine is "hg.ihepa.ufl.edu"
12/19 07:40:03 setting the orig job name in starter
12/19 07:40:03 setting the orig job iwd in starter
12/19 07:40:03 Job 1219662.0 set to execute immediately
12/19 07:40:03 Starting a VANILLA universe job with ID: 1219662.0
12/19 07:40:03 IWD: /share/home/cms31291/gram_scratch_RRbmV1FBMe
12/19 07:40:03 Output file:
/share/home/cms31291/.globus/job/hg.ihepa.ufl.edu/35
8.1229689455/stdout
12/19 07:40:03 Error file:
/share/home/cms31291/.globus/job/hg.ihepa.ufl.edu/358.1229689455/stderr
12/19 07:40:03 About to exec
/share/home/cms31291/.globus/.gass_cache/local/md5/
02/d5a0e9fd5e13006d0bf2c5381f3b0f/md5/96/55ad8460528337ce276628a2e787ad/data
UI=
000000:NS=0000000004:WM=000005:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:AP
P=000000:LBS=000000
12/19 07:40:03 Create_Process succeeded, pid=7353
12/19 07:46:57 Process exited, pid=7353, status=0
12/19 07:46:57 Got SIGQUIT. Performing fast shutdown.
12/19 07:46:57 ShutdownFast all jobs.
According to MatchLog on the gatekeeper, the jobs were preempted. But I
don't understand why higher rank and priority jobs were preempted by
lower ones. The condor version we are running is 7.0.4.
Thanks,
Yu