[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] file transfer issues




Hello,

Condor sometimes seems to have problems copying the job output to the correct location.
The output file is created when job starts, and remains unchanged at 0 bytes when job completes.
The _condor_stdout_ file stays around with the job's output. Here's a snippet from ShadowLog:

5/31 18:17:50 (493.0) (2412): Inside RemoteResource::updateFromStarter()
5/31 18:17:50 (493.0) (2412): Inside RemoteResource::resourceExit()
5/31 18:17:50 (493.0) (2412): setting exit reason on HED004 to 100
5/31 18:17:50 (493.0) (2412): Resource HED004 changing state from EXECUTING to FINISHED
5/31 18:17:50 (493.0) (2412): Entering DCStartd::deactivateClaim(forceful)
5/31 18:17:50 (493.0) (2412): SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/31 18:17:50 (493.0) (2412): DCStartd::deactivateClaim: successfully sent command
5/31 18:17:50 (493.0) (2412): Killed starter (fast) at <1.2.3.4:3915>
5/31 18:17:50 (493.0) (2412): Job 493.0 terminated: exited with status 0
5/31 18:18:08 (493.0) (2412): moveOutputFile: failed to read from '_condor_stdout_493.0': Invalid argument
5/31 18:18:09 (493.0) (2412): BaseShadow::emailUser() called.
5/31 18:18:09 (493.0) (2412): Trying to email, but SMTP_SERVER not specified in config file
5/31 18:18:09 (493.0) (2412): Entering BaseShadow::updateJobInQueue
5/31 18:18:09 (493.0) (2412): SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/31 18:18:09 (493.0) (2412): SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/31 18:18:09 (493.0) (2412): sspi_client_auth() entered
5/31 18:18:09 (493.0) (2412): sspi_client_auth() looping
5/31 18:18:09 (493.0) (2412): sspi_client_auth() exiting
5/31 18:18:09 (493.0) (2412): Updating Job Queue: SetAttribute(ExitBySignal, FALSE)
5/31 18:18:09 (493.0) (2412): Updating Job Queue: SetAttribute(ExitCode, 0)
5/31 18:18:09 (493.0) (2412): Updating Job Queue: SetAttribute(BytesSent, 1289.000000)
5/31 18:18:09 (493.0) (2412): Updating Job Queue: SetAttribute(BytesRecvd, 21962962.000000)
5/31 18:18:09 (493.0) (2412): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100



While digging in the logs I also noticed issues while sending the input files for the job.
It looks like the file transfer simply times out, correct? The good thing is the job gets tried again.
Should I blame this on excessive load (I was submitting several hundred jobs at the time, each with ~30 megs of input files)
and forget about it ?

5/31 18:17:13 (563.0) (2508): ReliSock: put_file: sent 5835 bytes
5/31 18:17:13 (563.0) (2508): DoUpload: send file \models\file1.tbl
5/31 18:17:13 (563.0) (2508): ReliSock: put_file: sent 8006 bytes
5/31 18:17:13 (563.0) (2508): DoUpload: send file \models\file2.tbl
5/31 18:17:14 (563.0) (2508): ReliSock: put_file: sent 264 bytes
5/31 18:17:14 (563.0) (2508): DoUpload: send file \models\file3.csv
5/31 18:17:36 (563.0) (2508): ReliSock: put_file: TransmitFile() failed, errno=10054
5/31 18:17:49 (563.0) (2508): ERROR "DoUpload: Failed to send file \models\file3.csv, exiting at 1398
" at line 1397 in file ..\src\condor_c++_util\file_transfer.C

$CondorVersion: 6.6.8 Jan 31 2005 $
$CondorPlatform: INTEL-WINNT40 $

Thanks for any hints,
Pawel

*************************************************************************
PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information. If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution is
strictly prohibited. If you are not the intended recipient, please notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
*************************************************************************