Some follow-up information for this issue. Looking at the Starterlog for one of the problem jobs I see the following: 09/30/15 08:34:52 ****************************************************** 09/30/15 08:34:52 ** condor_starter (CONDOR_STARTER) STARTING UP 09/30/15 08:34:52 ** /btbal3600/opt/condor-8.2.8/sbin/condor_starter 09/30/15 08:34:52 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 09/30/15 08:34:52 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 09/30/15 08:34:52 ** $CondorVersion: 8.2.8 Apr 10 2015 $ 09/30/15 08:34:52 ** $CondorPlatform: X86_64-sles_11 $ 09/30/15 08:34:52 ** PID = 5593 09/30/15 08:34:52 ** Log last touched 9/30 07:51:10 09/30/15 08:34:52 ****************************************************** 09/30/15 08:34:52 Using config source: /btbal3600/opt/condor-8.2.8/etc/condor_config 09/30/15 08:34:52 Using local config sources: 09/30/15 08:34:52 /opt/condor/local.btbal3600/condor_config.local 09/30/15 08:34:52 /opt/condor-8.2.8/local.btbal3600/condor_config.local 09/30/15 08:34:52 config Macros = 210, Sorted = 209, StringBytes = 5592, TablesBytes = 7616 09/30/15 08:34:52 CLASSAD_CACHING is OFF 09/30/15 08:34:52 Daemon Log is logging: D_ALWAYS D_ERROR 09/30/15 08:34:52 SharedPortEndpoint: waiting for connections to named socket 27196_cacb_878 09/30/15 08:34:52 DaemonCore: command socket at <10.11.132.72:51529?sock=27196_cacb_878> 09/30/15 08:34:52 DaemonCore: private command socket at <10.11.132.72:51529?sock=27196_cacb_878> 09/30/15 08:34:52 Communicating with shadow <10.11.129.10:47740?noUDP> 09/30/15 08:34:52 Submitting machine is "ci1makebal808.ds.susq.com" 09/30/15 08:34:52 setting the orig job name in starter 09/30/15 08:34:52 setting the orig job iwd in starter 09/30/15 08:34:52 Chirp config summary: IO false, Updates false, Delayed updates true. 09/30/15 08:34:52 Initialized IO Proxy. 09/30/15 08:34:52 Done setting resource limits 09/30/15 08:34:52 File transfer completed successfully. 09/30/15 08:34:53 Job 4565.0 set to execute immediately 09/30/15 08:34:53 Starting a VANILLA universe job with ID: 4565.0 09/30/15 08:34:53 IWD: /hostname/statbuildbot/PHXSLES11ValgrindNightly/build/SLES11SP2_GCC4.3.4_DEBUG_X86-64 09/30/15 08:34:53 Output file: /opt/condor-8.2.8/local.btbal3600/execute/dir_5593/_condor_stdout 09/30/15 08:34:53 Error file: /opt/condor-8.2.8/local.btbal3600/execute/dir_5593/_condor_stderr 09/30/15 08:34:53 Renice expr "0" evaluated to 0 09/30/15 08:34:53 About to exec /opt/condor-8.2.8/local.btbal3600/execute/dir_5593/condor_exec.exe 09/30/15 08:34:53 Running job as user statbuildbot 09/30/15 08:34:53 Create_Process succeeded, pid=5601 09/30/15 09:37:17 Process exited, pid=5601, status=0 And condor_q shows the job is transferring data: ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4565.0 statbuildbot 9/30 01:33 0+08:45:07 > 0 2197.3 global_yoda-arca_l And then in the StartLog we see: 09/30/15 08:34:51 slot1_5: New machine resource of type -1 allocated 09/30/15 08:34:51 slot1: Changing state: Owner -> Unclaimed 09/30/15 08:34:51 slot1: State change: IS_OWNER is TRUE 09/30/15 08:34:51 slot1: Changing state: Unclaimed -> Owner 09/30/15 08:34:51 Setting up slot pairings 09/30/15 08:34:51 slot1_5: Request accepted. 09/30/15 08:34:51 slot1_5: Remote owner is
statbuildbot@xxxxxxxxxxx 09/30/15 08:34:51 slot1_5: State change: claiming protocol successful 09/30/15 08:34:51 slot1_5: Changing state: Owner -> Claimed 09/30/15 08:34:52 slot1_5: match_info called 09/30/15 08:34:52 slot1_5: Got activate_claim request from shadow (10.11.129.10) 09/30/15 08:34:52 slot1_5: Remote job ID is 4565.0 09/30/15 08:34:52 slot1_5: Got universe "VANILLA" (5) from request classad 09/30/15 08:34:52 slot1_5: State change: claim-activation protocol successful 09/30/15 08:34:52 slot1_5: Changing activity: Idle -> Busy Followed later by: 09/30/15 10:33:23 ERROR: Child pid 5593 appears hung! Killing it hard. 09/30/15 10:33:23 Starter pid 5593 died on signal 9 (signal 9 (Killed)) 09/30/15 10:33:23 slot1_5: State change: starter exited 09/30/15 10:33:23 slot1_5: Changing activity: Busy -> Idle 09/30/15 10:33:23 slot1_5: State change: idle claim shutting down due to CLAIM_WORKLIFE 09/30/15 10:33:23 slot1_5: Changing state and activity: Claimed/Idle -> Preempting/Vacating 09/30/15 10:33:23 slot1_5: State change: No preempting claim, returning to owner 09/30/15 10:33:23 slot1_5: Changing state and activity: Preempting/Vacating -> Owner/Idle 09/30/15 10:33:23 slot1_5: Changing state: Owner -> Delete 09/30/15 10:33:23 slot1_5: Resource no longer needed, deleting It appears to me that the job is hung on transferring output for an hour after running the job to completion. Then after an hour the condor daemon copying the
data is determined to be hung and is killed. However we see the output file transferred to the schedd. Similar behavior is observed on all the jobs that don’t “finish”. The behavior only seems to appear in longer running jobs as all of the jobs are setup
in the same way. Thanks. -- Will Deck IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses. |