I think these are the examples I should give from ShadowLog of the
submitting machine:
12/16/13 18:11:08 ******************************************************
12/16/13 18:11:08 ** condor_shadow (CONDOR_SHADOW) STARTING UP
12/16/13 18:11:08 ** C:\Condor\bin\condor_shadow.exe
12/16/13 18:11:08 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
class=DAEMON(1)
12/16/13 18:11:08 ** Configuration: subsystem:SHADOW local:<NONE>
class:DAEMON
12/16/13 18:11:08 ** $CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
12/16/13 18:11:08 ** $CondorPlatform: x86_64_Windows7 $
12/16/13 18:11:08 ** PID = 5980
12/16/13 18:11:08 ** Log last touched 12/16 18:11:02
12/16/13 18:11:08 ******************************************************
12/16/13 18:11:08 Using config source: C:\condor\condor_config
12/16/13 18:11:08 Using local config sources:
12/16/13 18:11:08 C:\Condor/condor_config.local
12/16/13 18:11:08 DaemonCore: command socket at <x.y.z.189:9760>
12/16/13 18:11:08 DaemonCore: private command socket at <x.y.z.189:9760>
12/16/13 18:11:08 Initializing a VANILLA shadow for job 118.5
12/16/13 18:11:08 (118.5) (5980): Request to run on
slot2@xxxxxxxxxxxxxxxxxxx <x.y.z.158:9653> was ACCEPTED
12/16/13 18:11:08 (118.5) (5980): my_popen: CreateProcess failed
12/16/13 18:11:08 (118.5) (5980): FILETRANSFER: Failed to execute
C:\Condor/bin/curl_plugin, ignoring
12/16/13 18:11:08 (118.5) (5980): FILETRANSFER: failed to add plugin
"C:\Condor/bin/curl_plugin" because: FILETRANSFER:1:Failed to execute
C:\Condor/bin/curl_plugin, ignoring
12/16/13 18:11:45 (118.8) (6256): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.11) (3172): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.10) (6252): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.13) (6020): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.14) (7624): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.12) (6456): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.9) (1192): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.7) (5136): ReliSock: put_file: TransmitFile()
failed, errno=10054
12/16/13 18:11:45 (118.11) (3172): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.201:9716>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.201 failed to receive file
C:\Condor\execute\dir_5160\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.8) (6256): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.138:9635>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.138 failed to receive file
C:\Condor\execute\dir_10020\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.10) (6252): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.201:9666>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.201 failed to receive file
C:\Condor\execute\dir_784\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.13) (6020): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.189:9768>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.189 failed to receive file
C:\Condor\execute\dir_7364\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.14) (7624): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.189:9635>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.189 failed to receive file
C:\Condor\execute\dir_7344\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.12) (6456): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.201:9782>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.201 failed to receive file
C:\Condor\execute\dir_5604\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.9) (1192): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.138:9650>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.138 failed to receive file
C:\Condor\execute\dir_13532\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.7) (5136): DoUpload: SHADOW at x.y.z.189 failed to
send file(s) to <x.y.z.158:9800>: error sending
D:\delta\models\201X-Calibration\PEST\Calib\Condor\PEST_Qual_Out.inp;
STARTER at x.y.z.158 failed to receive file
C:\Condor\execute\dir_4964\PEST_Qual_Out.inp
12/16/13 18:11:45 (118.11) (3172): ERROR "Error from
slot3@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.10) (6252): ERROR "Error from
slot1@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.8) (6256): ERROR "Error from
slot2@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.14) (7624): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.13) (6020): ERROR "Error from
slot1@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.12) (6456): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.9) (1192): ERROR "Error from
slot3@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:45 (118.7) (5136): ERROR "Error from
slot4@xxxxxxxxxxxxxxxxxxx: Failed to transfer files" at line 558 in file
c:\condor\execute\dir_18384\userdir\src\condor_shadow.v6.1\pseudo_ops.cpp
12/16/13 18:11:50 ******************************************************
On Mon, Dec 16, 2013 at 6:30 PM, Ralph Finch <ralphmariafinch@xxxxxxxxx>wrote:
All Windows 7x64 pool
$CondorVersion: 8.0.4 Oct 19 2013 BuildID: 189770 $
$CondorPlatform: x86_64_Windows7 $
I've been getting lots of Shadow Exceptions, here's a typical one (job log
file):
000 (117.019.000) 12/16 18:02:12 Job submitted from host: <x.y.z.189:9728>
...
007 (117.019.000) 12/16 18:08:08 Shadow exception!
Error from slot4@xxxxxxxxxxxxxxxxx: Failed to transfer files
0 - Run Bytes Sent By Job
13252 - Run Bytes Received By Job
...
The ShadowLog on the submit machine (.189) (bdomo-002):
12/16/13 18:18:22 (117.1) (6616): Job 117.1 is being evicted from
slot2@xxxxxxxxxxxxxxxxxxx
12/16/13 18:18:22 (117.1) (6616): **** condor_shadow (condor_SHADOW) pid
6616 EXITING WITH STATUS 102
12/16/13 18:19:38 (117.5) (8068): Job 117.5 is being evicted from
slot2@xxxxxxxxxxxxxxxxxxx
12/16/13 18:19:38 (117.5) (8068): **** condor_shadow (condor_SHADOW) pid
8068 EXITING WITH STATUS 102
12/16/13 18:19:40 (117.11) (7936): Job 117.11 is being evicted from
slot4@xxxxxxxxxxxxxxxxxxx
12/16/13 18:19:40 (117.11) (7936): **** condor_shadow (condor_SHADOW) pid
7936 EXITING WITH STATUS 102
12/16/13 18:23:01 (117.2) (6880): Job 117.2 is being evicted from
slot3@xxxxxxxxxxxxxxxxxxx
12/16/13 18:23:01 (117.2) (6880): **** condor_shadow (condor_SHADOW) pid
6880 EXITING WITH STATUS 102
12/16/13 18:23:12 (117.3) (6196): Job 117.3 is being evicted from
slot4@xxxxxxxxxxxxxxxxxxx
12/16/13 18:23:12 (117.3) (6196): **** condor_shadow (condor_SHADOW) pid
6196 EXITING WITH STATUS 102
We have a typical nominal 1 Gb/s switch for our LAN. The files transferred
for each submit job are a couple of dozen, and are at most 200 MB total
size. 20 jobs submitted at one time to the queue.
Should this really cause a problem? Is there a way to find out if a
failure to transfer files REALLY is the problem? I'm thinking not. Even
though Condor starts new execute jobs, the master program (run
interactively from a command prompt window) usually doesn't see them. So I
submit another 20, kill the old set, and everything is good, no shadow
exceptions and the master program finds its condorized slaves. Maybe the
shadows on my submit machine are giving up too quick because of some delay??
Ralph Finch
Calif. Dept. of Water Resources
Sacramento, Calif. USA
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/