Hello all We are having some serious problems with our condor setup and I at a loss. Hope someone can help me. We started seeing this problem this weekend. Jobs are being evicted and restarted. I have one example below, but we
have been seeing some other errors as well. They all seem to circle around losing connection with the schedd though. In the log of the job I see the following message 007 (727.000.000) 08/29 05:29:59 Shadow exception! Assertion ERROR on (result) 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job I go to the ShadowLog and find the following messages 08/29 05:29:58 (727.22) (27374): condor_write(): Socket closed when trying to write 13 bytes to startd slot12@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.8) (27376): condor_write(): Socket closed when trying to write 13 bytes to startd slot10@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.9) (27377): condor_write(): Socket closed when trying to write 13 bytes to startd slot11@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.3) (27365): condor_write(): Socket closed when trying to write 13 bytes to startd slot5@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.1) (27359): condor_write(): Socket closed when trying to write 13 bytes to startd slot3@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.8) (27376): Buf::write(): condor_write() failed 08/29 05:29:58 (727.0) (27355): condor_write(): Socket closed when trying to write 13 bytes to startd slot2@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (729.25) (27461): condor_write(): Socket closed when trying to write 13 bytes to startd slot4@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.8) (27376): ERROR "Assertion ERROR on (result)" at line 238 in file NTreceivers.cpp 08/29 05:29:58 (729.7) (27465): condor_write(): Socket closed when trying to write 13 bytes to startd slot10@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.3) (27365): Buf::write(): condor_write() failed 08/29 05:29:58 (729.27) (27449): condor_write(): Socket closed when trying to write 13 bytes to startd slot2@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.3) (27365): ERROR "Assertion ERROR on (result)" at line 238 in file NTreceivers.cpp 08/29 05:29:58 (727.0) (27355): Buf::write(): condor_write() failed 08/29 05:29:58 (729.3) (27456): condor_write(): Socket closed when trying to write 13 bytes to startd slot5@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.0) (27355): ERROR "Assertion ERROR on (result)" at line 238 in file NTreceivers.cpp 08/29 05:29:58 (727.19) (27369): condor_write(): Socket closed when trying to write 13 bytes to startd slot9@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (729.1) (27446): condor_write(): Socket closed when trying to write 13 bytes to startd slot3@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (729.23) (27448): condor_write(): Socket closed when trying to write 13 bytes to startd slot2@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:59 (729.1) (27446): Buf::write(): condor_write() failed 08/29 05:29:58 (729.18) (27472): condor_write(): Socket closed when trying to write 13 bytes to startd slot9@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (727.9) (27377): Buf::write(): condor_write() failed 08/29 05:29:58 (729.25) (27461): Buf::write(): condor_write() failed 08/29 05:29:58 (727.20) (27371): condor_write(): Socket closed when trying to write 13 bytes to startd slot10@xxxxxxxxxxxxxxxx, fd is 7 08/29 05:29:58 (731.0) (28148): condor_write(): Socket closed when trying to write 13 bytes to startd slot8@xxxxxxxxxxxxxxxx, fd is 7 And in SchedLog: 08/29 05:29:59 (pid:2534) Shadow pid 27355 for job 727.0 exited with status 4 08/29 05:29:59 (pid:2534) ERROR: Shadow exited with job exception code! 08/29 05:29:59 (pid:2534) Checking consistency running and runnable jobs 08/29 05:29:59 (pid:2534) Tables are consistent 08/29 05:29:59 (pid:2534) Rebuilt prioritized runnable job list in 0.017s. (Expedited rebuild because no match was found) 08/29 05:29:59 (pid:2534) Starting add_shadow_birthdate(727.0) 08/29 05:29:59 (pid:2534) Started shadow for job 727.0 on slot2@xxxxxxxxxxxxxxxx <10.69.200.99:56059> for rni@xxxxxxxxxx, (shadow pid = 30888) Here it seems that the shadow exited with exception, which is bad. And in StartLog 08/29 05:29:03 slot2: State change: claim lease expired (condor_schedd gone?) 08/29 05:29:03 slot2: Changing state and activity: Claimed/Busy -> Preempting/Killing 08/29 05:29:33 slot2: starter (pid 6298) is not responding to the request to hardkill its job. The startd will now directly hard kill the starter and all its decendents. 08/29 05:29:33 Starter pid 6298 died on signal 9 (signal 9 (Killed)) 08/29 05:29:33 slot2: State change: starter exited 08/29 05:29:33 slot2: State change: No preempting claim, returning to owner 08/29 05:29:33 slot2: Changing state and activity: Preempting/Killing -> Owner/Idle 08/29 05:29:33 slot2: State change: IS_OWNER is false 08/29 05:29:33 slot2: Changing state: Owner -> Unclaimed 08/29 05:29:33 State change: RunBenchmarks is TRUE 08/29 05:29:33 slot2: Changing activity: Idle -> Benchmarking 08/29 05:29:37 State change: benchmarks completed 08/29 05:29:37 slot2: Changing activity: Benchmarking -> Idle StarterLog.slot2: 08/28 13:24:44 ****************************************************** 08/28 13:24:44 ** condor_starter (CONDOR_STARTER) STARTING UP 08/28 13:24:44 ** /opt/condor/sbin/condor_starter 08/28 13:24:44 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 08/28 13:24:44 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 08/28 13:24:44 ** $CondorVersion: 7.4.4 Oct 13 2010 BuildID: 279383 $ 08/28 13:24:44 ** $CondorPlatform: X86_64-LINUX_RHEL5 $ 08/28 13:24:44 ** PID = 6298 08/28 13:24:44 ** Log last touched 8/28 12:40:26 08/28 13:24:44 ****************************************************** 08/28 13:24:44 Using config source: /opt/condor/etc/condor_config 08/28 13:24:44 Using local config sources: 08/28 13:24:44 /home/condor/hosts/cmp03/condor_config.local 08/28 13:24:44 DaemonCore: Command Socket at <192.168.0.99:41591> 08/28 13:24:44 Done setting resource limits 08/28 13:24:44 Communicating with shadow <192.168.0.82:38708> 08/28 13:24:44 Submitting machine is "cmp04.hpcalc.net" 08/28 13:24:44 setting the orig job name in starter 08/28 13:24:44 setting the orig job iwd in starter 08/28 13:24:44 Job 727.0 set to execute immediately 08/28 13:24:44 Starting a VANILLA universe job with ID: 727.0 08/28 13:24:44 IWD: /data/proj/P04738_PetrojarlVarg/FLACS/Dispersion/Turret 08/28 13:24:44 Output file: /data/proj/Turret/flacs_011211.out 08/28 13:24:44 Error file: /data/proj/Turret/flacs_011211.err 08/28 13:24:44 About to exec /usr/local/bin/runflacs -j 011211 08/28 13:24:44 Create_Process succeeded, pid=6299 Regards Peter |