Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Problem using schedd web service
- Date: Fri, 28 Oct 2005 09:37:32 +0100
- From: Peter Ledbrook <peter.ledbrook@xxxxxxxxx>
- Subject: Re: [Condor-users] Problem using schedd web service
Matthew Farrellee wrote:
All the files that are going to be input or output (including Out/
Err) should be declared.
If after you declare Out and Err you still have trouble you should
try to add StageInStart and StageInFinish, both set to some non-zero
integer, to the job ad. CreateJobTemplate will add those attributes
for you in future versions of Condor.
matt
I'm sorry to bother again, but I'm still having a problem, although at
least it's a different problem now! I am declaring the stdout and stderr
files like this:
// Declare the stdout and stderr files.
Status retval = stub.declareFile(
txn,
clusterId,
jobId,
OUTPUT_FILENAME,
Integer.MAX_VALUE, // Also tried with -1
HashType.NOHASH,
null);
retval = stub.declareFile(
txn,
clusterId,
jobId,
ERROR_FILENAME,
Integer.MAX_VALUE,
HashType.NOHASH,
null);
and I have also included StageInStart ("10") and StageInFinish("20") in
the JobAd. However, I am getting the following in the ShadowLog:
10/28 09:08:00 (140.0) (8981):Requesting Primary Starter
10/28 09:08:00 (140.0) (8981):Shadow: Request to run a job was ACCEPTED
10/28 09:08:00 (140.0) (8981):Shadow: RSC_SOCK connected, fd = 17
10/28 09:08:00 (140.0) (8981):Shadow: CLIENT_LOG connected, fd = 18
10/28 09:08:00 (140.0) (8981):My_Filesystem_Domain = "ixico.net"
10/28 09:08:00 (140.0) (8981):My_UID_Domain = "ixico.net"
10/28 09:08:00 (140.0) (8981): Entering pseudo_get_file_stream
10/28 09:08:00 (140.0) (8981): file =
"/opt/condor-6.6.10/examples/env.remote"
10/28 09:08:00 (140.0) (8981): Weird 0xc0a8010b
10/28 09:08:00 (140.0) (8981): Weird 0xc0a8010b
10/28 09:08:00 (140.0) (8981):Reaped child status - pid 8983 exited with
status 0
10/28 09:08:00 (140.0) (8981):Read: condor_restart:
10/28 09:08:00 (140.0) (8981):Read: Checkpoint file name is
"/home/condor/spool/cluster140.proc0.subproc0"
10/28 09:08:00 (140.0) (8981): Entering pseudo_get_file_stream
10/28 09:08:00 (140.0) (8981): file =
"/home/condor/spool/cluster140.proc0.subproc0"
10/28 09:08:00 (140.0) (8981): Weird 0xc0a8010b
10/28 09:08:00 (140.0) (8981): Weird 0xc0a8010b
10/28 09:08:00 (140.0) (8981):Read: Opened
"/home/condor/spool/cluster140.proc0.subproc0" via file stream
10/28 09:08:00 (140.0) (8987):Failed to transfer 96 bytes (only sent -1)
10/28 09:08:00 (140.0) (8981):Reaped child status - pid 8987 exited with
status 1
10/28 09:08:00 (140.0) (8981):Shadow: Job 140.0 exited, termsig = 9,
coredump = 0, retcode = 0
10/28 09:08:00 (140.0) (8981):Shadow: Job was kicked off without a
checkpoint
10/28 09:08:00 (140.0) (8981):Shadow: DoCleanup: unlinking TmpCkpt
'/home/condor/spool/cluster140.proc0.subproc0.tmp'
10/28 09:08:00 (140.0) (8981):Trying to unlink
/home/condor/spool/cluster140.proc0.subproc0.tmp
10/28 09:08:00 (140.0) (8981):user_time = 1 ticks
10/28 09:08:00 (140.0) (8981):sys_time = 2 ticks
10/28 09:08:00 (140.0) (8981):********** Shadow Exiting(107) **********
It looks like Condor is trying write to a directory
(/home/condor/spool/cluster140.proc0.subproc0) as if it were a file, but
I have no idea why. Any suggestions?
Thanks,
Peter