Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Checkpointing Errors
- Date: Tue, 8 May 2007 12:15:36 +0100 (BST)
- From: Simon David Hammond <sdh@xxxxxxxxxxxxxxxxx>
- Subject: [Condor-users] Checkpointing Errors
Hi All,
I seem to be having major problems checkpointing, the jobs run OK but when
interupted I get the following message in ShadowLog:
5/8 11:56:09 (1.1) (2846):Read: About to write checkpoint
5/8 11:56:09 (1.1) (2846):Read: Image::Write(): fd -1 file_name
/var/tmp/dcsoff-15-condor/spool/cluster1.proc1.subproc0
5/8 11:56:09 (1.1) (2846):Read: Checkpoint name is
"/var/tmp/dcsoff-15-condor/spool/cluster1.proc1.subproc0"
5/8 11:56:09 (1.1) (2846):Read: Tmp name is
"/var/tmp/dcsoff-15-condor/spool/cluster1.proc1.subproc0.tmp"
5/8 11:56:09 (1.1) (2846): Entering pseudo_put_file_stream
5/8 11:56:09 (1.1) (2846): file =
"/var/tmp/dcsoff-15-condor/spool/cluster1.proc1.subproc0.tmp"
5/8 11:56:09 (1.1) (2846): len = 66511871
5/8 11:56:09 (1.1) (2846): owner = condor
5/8 11:56:09 (1.1) (2846): Weird 0xf77cd89
5/8 11:56:09 (1.1) (2846):Returned addr
5/8 11:56:09 (1.1) (2846): 137.205.119.15
5/8 11:56:09 (1.1) (2846):Returned port 53211
5/8 11:56:09 (1.1) (2846):Read: connect() failed - errno = 111
5/8 11:56:09 (1.1) (2846):Read: open_tcp_stream() failed
5/8 11:56:09 (1.1) (2846):Read: ERROR:open_ckpt_file failed, aborting ckpt
5/8 11:56:09 (1.1) (2846):Read: Ckpt exit
5/8 11:56:09 (1.1) (2846):Read: Write failed with [-1]
5/8 11:56:09 (1.1) (2846):Shadow: Job 1.1 exited, termsig = 9, coredump =
0, retcode = 0
Our LOWPORT is 9000 and HIGHPORT is 9500 for servers and 9060 for clients.
I'm confused as to why the checkpointing system is picking 53211 and I
can't seem to find a configuration option to change it! There aren't any
checkpoint files in the disk and the TransferLog shows a negative number
of bytes being received - so I think that it probably counts as failed?
The image size of the job does seem to update however on the condor_q
listing but the jobs seem to run forever and never finish which makes me
think the checkpointing isn't happening and they are being restarted.
I'd be grateful for any help!!!
Thanks,
Si Hammond
Univ. of Warwick