[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor VMs receive signal 9, deactivate_claim_forcibly



Hi All

First I'll specify the platform:
  OS: Scientific Linux SL release 3.0.5 (SL)
  Worker Nodes: Dual Intel XEON 2.40GHz, 2 GB RAM
  Problem seen under Condor 6.7.3 and 6.7.11
  schedd on masternode004 (10.0.11.32)
  collector/negotiator on node222 (10.0.11.36)

condor_config has 3 VMs defined for each worker node, but VM1 is reserved 
for test jobs via the start definition
START = ( \
          ((VirtualMachineID == 1) && $(TEST_START) && ((CAFGroup =?= "test") \
           || ((CAFGroup =?= "hiprio") && $(CAF_HIPRIO_ON_TEST)) )) || \
          ((VirtualMachineID == 2) && $(PRODUCTION_START) && ((CAFGroup =?= "production") \
           || ((CAFGroup =?= "hiprio") && $(CAF_HIPRIO_ON_PRODUCTION)) )) || \
          ((VirtualMachineID > ($(NUM_CPUS)-$(NUM_DEFAULT_CPUS))) \
           && (VirtualMachineID > ($(NUM_CPUS)-$(CAF_DEFAULT_MAX))) && $(DEFAULT_START) \
           && (CAFGroup =!= "production") && (CAFGroup =!= "Held")) \
        )

I am having trouble with jobs receiving a signal 9
(deactivate_claim_forcibly), mostly when they reach the end of the job.  
Under Condor 6.7.3 this occurred always to jobs running under VM3 but I
upgraded to 6.7.11 yesterday and I now see similar failures under VM2 and
VM3.  A test job foo (reproduced below) that runs for 10 minutes and
utilizes a bit of CPU tends to fail in the last minute of the job.  Of 100
jobs submitted, 43 failed, 19 on VM2 and 24 on VM3. The same thing occurs
if I increase the running time to 30 mins, i.e. the job fails near the
end.  I suspect a bug in Condor???

I turned on FULLDEBUG and I picked a failure at random (110123.82). 
The StarterLog.vm2 and StartLog do not contain any helpful information 
that I can see.  Authentication is via Kerberos so I removed lots of 
extraneous logging info below.  

======================= StarterLog.vm2 ============================== 
9/23 11:56:29 ****************************************************** 
9/23 11:56:29 ** condor_starter (CONDOR_STARTER) STARTING UP 
9/23 11:56:29 ** /home/cdfcaf/condor/dist_6.7.11/sbin/condor_starter 
9/23 11:56:29 ** $CondorVersion: 6.7.11 Sep 15 2005 $ 
9/23 11:56:29 ** $CondorPlatform:I386-LINUX_RH9 $ 
9/23 11:56:29 ** PID = 18674 
9/23 11:56:29****************************************************** 
9/23 11:56:29 Communicating with shadow <10.0.11.32:40265> 
9/23 11:56:29 Shadow version: $CondorVersion: 6.7.11 Sep 15 2005 $ 
9/23 11:56:29 Starting a VANILLA universe job with ID: 110123.82 
9/23 11:56:29 Create_Process succeeded, pid=18679
   <snip>
9/23 12:06:01 Process exited, pid=18679, signal=9
9/23 12:06:01 in VanillaProc::JobCleanup()
9/23 12:06:01 Inside OsProc::JobCleanup()
9/23 12:06:01 Reaper: all=1 handled=1 ShuttingDown=0
9/23 12:06:01 Inside OsProc::JobExit()
9/23 12:06:01 In VanillaProc::PublishUpdateAd()
9/23 12:06:01 Inside OsProc::PublishUpdateAd()
9/23 12:06:01 Doing CONDOR_job_exit


================== StartLog ===================================
9/23 11:56:29 vm2: Remote job ID is 110123.82
9/23 11:56:29 ProcAPI::buildFamily() Found daddypid on the system: 18674
9/23 11:56:29 vm2: Got universe "VANILLA" (5) from request classad
9/23 11:56:29 vm2: State change: claim-activation protocol successful
9/23 11:56:29 vm2: Changing activity: Idle -> Busy
   <snip>
9/23 12:06:01 vm2: Called deactivate_claim_forcibly()
9/23 12:06:01 vm2: In Starter::kill() with pid 18674, sig 3 (SIGQUIT)
9/23 12:06:01 Starter pid 18674 exited with status 0
9/23 12:06:01 vm2: Canceled hardkill-starter timer (772)
9/23 12:06:01 ProcAPI::buildFamily failed: parent 18674 not found on system.
9/23 12:06:01 ProcAPI::getProcInfo() pid 18674 does not exist. (repeated)
9/23 12:06:01 vm2: State change: starter exited
9/23 12:06:01 vm2: Changing activity: Busy -> Idle


The ShadowLog on the schedd also does not give me any clues as to why this 
occurs:

================== ShadowLog ===================================
9/23 11:56:29 Initializing a VANILLA shadow for job 110123.82
<snip>
9/23 12:00:29 (110123.82) (26917): Getting monitoring info for pid 26917
9/23 12:04:29 (110123.82) (26917): Getting monitoring info for pid 26917
9/23 12:06:01 (110123.82) (26917): About to decode condor_sysnum
9/23 12:06:01 (110123.82) (26917): Got request for syscall -65
9/23 12:06:01 (110123.82) (26917): in pseudo_job_exit: status=9,reason=100
9/23 12:06:01 (110123.82) (26917): Inside RemoteResource::updateFromStarter()
9/23 12:06:01 (110123.82) (26917): Update ad:
MyType = "(unknown type)"
TargetType = "(unknown type)"
RemoteSysCpu = 112
RemoteUserCpu = 47
ImageSize = 15496
JobState = "Exited"
NumPids = 0
JobPid = 18679
JobStartDate = 1127490989
JobDuration = 571.654663
ExitBySignal = TRUE
ExitSignal = 9
ExitReason = "died on signal 9"
99/23 12:06:01 (110123.82) (26917): --- End of ClassAd ---
9/23 12:06:01 (110123.82) (26917): Inside RemoteResource::resourceExit()
9/23 12:06:01 (110123.82) (26917): setting exit reason on vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx to 100
9/23 12:06:01 (110123.82) (26917): Resource vm2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx changing state from EXECUTING 
to FINISHED
9/23 12:06:01 (110123.82) (26917):      rval = 0, errno = 0
9/23 12:06:01 (110123.82) (26917): Shadow: do_REMOTE_syscall returned < 0
9/23 12:06:01 (110123.82) (26917): Entering DCStartd::deactivateClaim(forceful)
9/23 12:06:01 (110123.82) (26917): STARTCOMMAND: starting 404 to <10.0.10.32:36881> on TCP port 40877.
9/23 12:06:01 (110123.82) (26917): SECMAN: command 404 to <10.0.10.32:36881> on TCP port 40877.
9/23 12:06:01 (110123.82) (26917): SECMAN: using session 
node186:20398:1127490989:5 for {<10.0.10.32:36881>,<404>}.
9/23 12:06:01 (110123.82) (26917): SECMAN: found cached session id 
node186:20398:1127490989:5 for {<10.0.10.32:36881>,<404>}.
9/23 12:06:01 (110123.82) (26917): SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
9/23 12:06:01 (110123.82) (26917): DCStartd::deactivateClaim: successfully sent command
9/23 12:06:01 (110123.82) (26917): Killed starter (fast) at <10.0.10.32:36881>
9/23 12:06:01 (110123.82) (26917): Job 110123.82 terminated: killed by signal 9
9/23 12:06:01 (110123.82) (26917): Forking Mailer process...




================= Test job: foo ==================
#!/bin/bash
 
echo "****** Running foo on `date` ******"
echo "****** hostname -f ******"
hostname -f
echo "****** pwd **************"
pwd
 
start_time=`date +%s`
#end_time=$((start_time+600))
end_time=$((start_time+$1))
echo "start time $start_time"
echo "end time $end_time"
current_time=`date +%s`
time while  [ $current_time -lt  $end_time  ]
do
  current_time=`date +%s`
  junk=$((1+2+3+4+5))
done
echo "*** Job ending at `date`****"
exit
=============================================

Thanks
Leslie Groer 
groer@xxxxxxxxxxxxxxxxxxx