Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] pseudo_ops.cpp on starter
- Date: Wed, 20 Oct 2010 04:22:36 -0700
- From: Mag Gam <magawake@xxxxxxxxx>
- Subject: [Condor-users] pseudo_ops.cpp on starter
Hello all,
I am having a problem where jobs are restarting on their own. The job
runcount is more than 1 for many jobs.
We keep seeing ... on our start log, "line 649 in file pseudo_ops.cpp"
Is this a known issue?
Schedd version: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009
BuildID: 159529 $"
Startd: CondorVersion = "$CondorVersion: 7.2.4 Jun 15 2009 BuildID: 159529 $"
Operating System Version=RHEL 5.2
I also have MAX_JOBS_RUNNING = 10000
ShadowLog:
The job is running fine then suddenly it gives this error in the ShadowLog
10/19 17:33:56 (28.3) (29653): DaemonCore: Leaving SendAliveToParent() - success
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(RemoteSysCpu = 682.000000)
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(RemoteUserCpu = 62981.000000)
10/19 17:33:58 (28.3) (29653): Updating Job Queue:
SetAttribute(LastJobLeaseRenewal = 1287523744)
10/19 17:34:04 (28.3) (29653): Inside RemoteResource::updateFromStarter()
10/19 17:38:35 (28.3) (29653): FileLock::obtain(1) -
@1287524315.981362 lock on /home/mech1//job.out.28 now WRITE
10/19 17:38:35 (28.3) (29653): FileLock::obtain(2) -
@1287524315.985954 lock on /home/mech1//job.out.28 now UNLOCKED
10/19 17:38:35 (28.3) (29653): ERROR "Error from starter on
slot1@xxxxxxxxxxxxxxxxxxxx: ProcD has failed" at line 649 in file
pseudo_ops.cpp
10/19 17:43:27 Initializing a VANILLA shadow for job 28.3
10/19 17:43:27 (28.3) (10012): FileLock object is updating timestamp
on: /home/mech1//job.out.28
10/19 17:43:27 (28.3) (10012): UserLog = /home/mech1//job.out.28
10/19 17:43:27 (28.3) (10012): *** Reserved Swap = 5120
10/19 17:43:27 (28.3) (10012): *** Free Swap = 8388440
10/19 17:43:27 (28.3) (10012): in RemoteResource::initStartdInfo()
10/19 17:43:27 (28.3) (10012): Entering DCStartd::activateClaim()
10/19 17:43:27 (28.3) (10012): Initialized the following authorization table:
10/19 17:43:27 (28.3) (10012): Authorizations yet to be resolved:
10/19 17:43:27 (28.3) (10012): allow READ: */*.mech.mich.edu
10/19 17:43:27 (28.3) (10012): allow WRITE: */*.mech.mich.edu
StarterLog.slot1.7:10/18 22:03:56 Job 28.3 set to execute immediately
StarterLog.slot1.7:10/18 22:03:56 Starting a VANILLA universe job with ID: 28.3
StarterLog.slot1.7:10/18 22:03:56 Output file: /home/mech1//stdout.28.3
StarterLog.slot1.7:10/18 22:03:56 Error file: /home/mech1//stderr.28.3
The job runcount now stands to 3.