Dan Bradley wrote:
Chris Miles wrote:
I have completely started fresh. reinstalled and started with no log files whatsoever.
The job file (hello.sub) contains.
executable = helloworld universe = vanilla should_transfer_files = YES when_to_transfer_output = ON_EXIT requirements = (Arch == "X86_64") && (OpSys == "LINUX") output = output_$(Process).out error = error_$(Process).out log = log.out Queue 5
<snip> The only logs that you sent which are relevant are the shadow logs. The starter logs on the execute machine (not the submit machine) would also be useful.
ShadowLog
10/12 01:38:27 ******************************************************
10/12 01:38:27 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/12 01:38:27 ** /home/condor/release/sbin/condor_shadow
10/12 01:38:27 ** $CondorVersion: 6.7.10 Aug 3 2005 $
10/12 01:38:27 ** $CondorPlatform: I386-LINUX_RH9 $
10/12 01:38:27 ** PID = 12878
10/12 01:38:27 ******************************************************
10/12 01:38:27 Using config file: /home/condor/etc/condor_config
10/12 01:38:27 Using local config files:
/home/condor/hosts/thebeast/condor_config.local
10/12 01:38:27 DaemonCore: Command Socket at <192.168.1.1:45639>
10/12 01:38:27 SEC_DEFAULT_SESSION_DURATION is undefined, using default
value of 3600
10/12 01:38:27 Reading job ClassAd from STDIN
10/12 01:38:27 Initializing a VANILLA shadow for job 1.0
10/12 01:38:27 (1.0) (12878): ENABLE_USERLOG_LOCKING is undefined, using
default value of True
10/12 01:38:27 (1.0) (12878): UserLog = /home/condor/jobs/helloworld/log.out
10/12 01:38:27 (1.0) (12878): *** Reserved Swap = 0
10/12 01:38:27 (1.0) (12878): *** Free Swap = 787168
10/12 01:38:27 (1.0) (12878): in RemoteResource::initStartdInfo()
10/12 01:38:27 (1.0) (12878): SHADOW_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
10/12 01:38:27 (1.0) (12878): Entering DCStartd::activateClaim()
10/12 01:38:27 (1.0) (12878): DCStartd::activateClaim: successfully sent
command, reply is: 1
10/12 01:38:27 (1.0) (12878): Request to run on <192.168.1.101:35193> was
ACCEPTED
10/12 01:38:27 (1.0) (12878): Resource vm1@xxxxxxxxxxxxxxxxx changing state
from PRE to STARTUP
10/12 01:38:27 (1.0) (12878): Getting monitoring info for pid 12878
10/12 01:38:27 (1.0) (12878): entering FileTransfer::Init
10/12 01:38:27 (1.0) (12878): entering FileTransfer::SimpleInit
10/12 01:38:27 (1.0) (12878): entering FileTransfer::HandleCommands
10/12 01:38:27 (1.0) (12878): FileTransfer::HandleCommands read
transkey=1#434c5b036fe0c01059a0454b
10/12 01:38:27 (1.0) (12878): entering FileTransfer::Upload
10/12 01:38:27 (1.0) (12878): entering FileTransfer::DoUpload
10/12 01:38:27 (1.0) (12878): DoUpload: send file
/home/condor/hosts/thebeast/spool/cluster1.ickpt.subproc0
10/12 01:38:27 (1.0) (12878): ReliSock::put_file_with_permissions(): going
to send permissions 100755
10/12 01:38:27 (1.0) (12878): put_file: going to send from filename
/home/condor/hosts/thebeast/spool/cluster1.ickpt.subproc0
10/12 01:38:27 (1.0) (12878): put_file: Found file size 10457
10/12 01:38:27 (1.0) (12878): put_file: senting 10457 bytes
10/12 01:38:27 (1.0) (12878): ReliSock: put_file: sent 10457 bytes
10/12 01:38:27 (1.0) (12878): DoUpload: exiting at 1605
10/12 01:38:28 (1.0) (12878): DaemonCore: in SendAliveToParent()
10/12 01:38:28 (1.0) (12878): DaemonCore: attempting to connect to
'<192.168.1.1:45580>'
10/12 01:38:28 (1.0) (12878): SHADOW_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
10/12 01:38:28 (1.0) (12878): SEC_TCP_SESSION_TIMEOUT is undefined, using
default value of 20
10/12 01:38:28 (1.0) (12878): Resource vm1@xxxxxxxxxxxxxxxxx changing state
from STARTUP to EXECUTING
10/12 01:38:28 (1.0) (12878): SHADOW_QUEUE_UPDATE_INTERVAL is undefined,
using default value of 900
10/12 01:38:28 (1.0) (12878): QmgrJobUpdater: started timer to update queue
(tid=7)
10/12 01:38:28 (1.0) (12878): Inside RemoteResource::updateFromStarter()
10/12 01:38:28 (1.0) (12878): Inside RemoteResource::resourceExit()
10/12 01:38:28 (1.0) (12878): setting exit reason on vm1@xxxxxxxxxxxxxxxxx
to 100
10/12 01:38:28 (1.0) (12878): Resource vm1@xxxxxxxxxxxxxxxxx changing state
from EXECUTING to FINISHED
10/12 01:38:28 (1.0) (12878): Entering DCStartd::deactivateClaim(forceful) 10/12 01:38:28 (1.0) (12878): SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 10/12 01:38:28 (1.0) (12878): DCStartd::deactivateClaim: successfully sent command 10/12 01:38:28 (1.0) (12878): Killed starter (fast) at <192.168.1.101:35193> 10/12 01:38:28 (1.0) (12878): Job 1.0 terminated: exited with status 0 10/12 01:38:28 (1.0) (12878): Forking Mailer process... 10/12 01:38:28 (1.0) (12878): SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0 10/12 01:38:28 (1.0) (12878): AUTHENTICATE_FS: used file /tmp/qmgr_Kl41Hy, status: 1 10/12 01:38:28 (1.0) (12878): Updating Job Queue: SetAttribute(LastJobLeaseRenewal = 1129077508) 10/12 01:38:28 (1.0) (12878): Updating Job Queue: SetAttribute(ExitBySignal = FALSE) 10/12 01:38:28 (1.0) (12878): Updating Job Queue: SetAttribute(ExitCode = 0) 10/12 01:38:28 (1.0) (12878): Updating Job Queue: SetAttribute(BytesSent = 0.000000) 10/12 01:38:28 (1.0) (12878): Updating Job Queue: SetAttribute(BytesRecvd = 10457.000000) 10/12 01:38:28 (1.0) (12878): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100 10/12 01:38:29 PASSWD_CACHE_REFRESH is undefined, using default value of 300
I see now file downloads happening. There are log messages about put_file, but no get_file. Therefore, it seems to me that either your job did not produce output, or something is going wrong on the execute machine. Please send StarterLog from a machine that is executing one of these jobs.
--Dan
--Dan