Condor sometimes seems to have problems copying the job output to the correct location.
The output file is created when job starts, and remains unchanged at 0 bytes when job completes.
The _condor_stdout_ file stays around with the job's output. Here's a snippet from ShadowLog:
5/31 18:17:50 (493.0) (2412): Inside RemoteResource::updateFromStarter()
5/31 18:17:50 (493.0) (2412): Inside RemoteResource::resourceExit()
5/31 18:17:50 (493.0) (2412): setting exit reason on HED004 to 100
5/31 18:17:50 (493.0) (2412): Resource HED004 changing state from EXECUTING to FINISHED
5/31 18:17:50 (493.0) (2412): Entering DCStartd::deactivateClaim(forceful)
5/31 18:17:50 (493.0) (2412): SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/31 18:17:50 (493.0) (2412): DCStartd::deactivateClaim: successfully sent command
5/31 18:17:50 (493.0) (2412): Killed starter (fast) at <1.2.3.4:3915>
5/31 18:17:50 (493.0) (2412): Job 493.0 terminated: exited with status 0
5/31 18:18:08 (493.0) (2412): moveOutputFile: failed to read from '_condor_stdout_493.0': Invalid argument
5/31 18:18:09 (493.0) (2412): BaseShadow::emailUser() called.
5/31 18:18:09 (493.0) (2412): Trying to email, but SMTP_SERVER not specified in config file
5/31 18:18:09 (493.0) (2412): Entering BaseShadow::updateJobInQueue
5/31 18:18:09 (493.0) (2412): SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0
5/31 18:18:09 (493.0) (2412): SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
5/31 18:18:09 (493.0) (2412): sspi_client_auth() entered
5/31 18:18:09 (493.0) (2412): sspi_client_auth() looping
5/31 18:18:09 (493.0) (2412): sspi_client_auth() exiting
5/31 18:18:09 (493.0) (2412): Updating Job Queue: SetAttribute(ExitBySignal, FALSE)
5/31 18:18:09 (493.0) (2412): Updating Job Queue: SetAttribute(ExitCode, 0)
5/31 18:18:09 (493.0) (2412): Updating Job Queue: SetAttribute(BytesSent, 1289.000000)
5/31 18:18:09 (493.0) (2412): Updating Job Queue: SetAttribute(BytesRecvd, 21962962.000000)
5/31 18:18:09 (493.0) (2412): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100
While digging in the logs I also noticed issues while sending the input files for the job.
It looks like the file transfer simply times out, correct? The good thing is the job gets tried again.
Should I blame this on excessive load (I was submitting several hundred jobs at the time, each with ~30 megs of input files)
and forget about it ?
$CondorVersion: 6.6.8 Jan 31 2005 $
$CondorPlatform: INTEL-WINNT40 $
Thanks for any hints,
Pawel
*************************************************************************
PRIVILEGED AND CONFIDENTIAL: This communication, including attachments, is
for the exclusive use of addressee and may contain proprietary,
confidential and/or privileged information. If you are not the intended
recipient, any use, copying, disclosure, dissemination or distribution is
strictly prohibited. If you are not the intended recipient, please notify
the sender immediately by return e-mail, delete this communication and
destroy all copies.
*************************************************************************