Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Checkpointing Errors
- Date: Tue, 8 May 2007 14:49:48 +0100 (BST)
- From: Simon David Hammond <sdh@xxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Checkpointing Errors
Hi All,
Is Condor configured to send the checkpoint back to the condor_shadow
process, or have you configured a checkpoint server?
We have configured a checkpoint server, it runs on what we identify as a
server. So it has a HIGHPORT of 9500 and a LOWPORT of 9000.
In the TransferLog I get:
5/8 13:54:54 R F -1075417848 bytes 120 sec 0.0.0.0
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15
5/8 13:54:58 R F -1075417848 bytes 120 sec 0.0.0.0
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15
5/8 13:54:58 R F -1075417848 bytes 120 sec 0.0.0.0
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15
5/8 13:58:13 R F 0 bytes 120 sec 0.0.0.0
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15
5/8 14:19:18 R F -1075417848 bytes 120 sec 0.0.0.0
condor@xxxxxxxxxxxxxxxxxxxxxxxxxxx@137.205.119.15
The number of bytes is obviously a concern since its negative and very
low.
5/8 11:17:14 (1.23) (4640):Shadow: Request to run a job was ACCEPTED
5/8 11:17:14 (1.23) (4640):Shadow: RSC_SOCK connected, fd = 17
5/8 11:17:14 (1.23) (4640):Shadow: CLIENT_LOG connected, fd = 18
5/8 11:17:14 (1.23) (4640):My_Filesystem_Domain = "dcs.warwick.ac.uk"
5/8 11:17:14 (1.23) (4640):My_UID_Domain = "dcs.warwick.ac.uk"
5/8 11:17:14 (1.23) (4640): Entering pseudo_get_file_stream
5/8 11:17:14 (1.23) (4640): file = "/dcs/condor/condor/bin/octave"
5/8 11:17:23 (1.23) (4640):Reaped child status - pid 4645 exited with
status 0
5/8 11:17:23 (1.23) (4640):Read: User Job - $CondorPlatform:
I386-LINUX_RHEL3 $
5/8 11:17:23 (1.23) (4640):Read: User Job - $CondorVersion: 6.9.2 Apr 9
2007 $
5/8 11:17:23 (1.23) (4640):Read: Checkpoint file name is
"/var/tmp/dcsoff-15-condor/spool/cluster1.proc23.subproc0"
5/8 11:17:45 (1.20) (3751):Read: Got SIGTSTP
5/8 11:17:45 (1.20) (3751):Read: Saved signal state.
5/8 11:17:45 (1.20) (3751):Read: About to save file state
5/8 11:17:45 (1.20) (3751):Read: CondorFileTable::checkpoint
5/8 11:17:45 (1.20) (3751):Read: OPEN FILE TABLE:
5/8 11:17:45 (1.20) (3751):Read: fd 0
5/8 11:17:45 (1.20) (3751):Read: logical name:
/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//octaveinput.txt
5/8 11:17:45 (1.20) (3751):Read: offset: 23
5/8 11:17:45 (1.20) (3751):Read: dups: 1
5/8 11:17:45 (1.20) (3751):Read: open flags: 0x0
5/8 11:17:45 (1.20) (3751):Read: url:
local:/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//octaveinput.txt
5/8 11:17:45 (1.20) (3751):Read: size: 23
5/8 11:17:45 (1.20) (3751):Read: opens: 1
5/8 11:17:45 (1.20) (3751):Read: fd 1
5/8 11:17:45 (1.20) (3751):Read: logical name:
/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//out.log
5/8 11:17:45 (1.20) (3751):Read: offset: 2472980
5/8 11:17:45 (1.20) (3751):Read: dups: 1
5/8 11:17:45 (1.20) (3751):Read: open flags: 0x1
5/8 11:17:45 (1.20) (3751):Read: url:
local:/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//out.log
5/8 11:17:45 (1.20) (3751):Read: size: 2472980
5/8 11:17:45 (1.20) (3751):Read: opens: 1
5/8 11:17:45 (1.20) (3751):Read: fd 2
5/8 11:17:45 (1.20) (3751):Read: logical name:
/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//err.log
5/8 11:17:45 (1.20) (3751):Read: offset: 90
5/8 11:17:45 (1.20) (3751):Read: dups: 1
5/8 11:17:45 (1.20) (3751):Read: open flags: 0x1
5/8 11:17:45 (1.20) (3751):Read: url:
local:/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//err.log
5/8 11:17:45 (1.20) (3751):Read: size: 90
5/8 11:17:45 (1.20) (3751):Read: opens: 1
5/8 11:17:45 (1.20) (3751):Read: fd 3
5/8 11:17:45 (1.20) (3751):Read: logical name:
/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//Y_20.txt
5/8 11:17:45 (1.20) (3751):Read: offset: 0
5/8 11:17:45 (1.20) (3751):Read: dups: 1
5/8 11:17:45 (1.20) (3751):Read: open flags: 0x1
5/8 11:17:45 (1.20) (3751):Read: url:
local:/dcs/condor/condor/riteshfiles/riteshgrid/./S_20//Y_20.txt
5/8 11:17:45 (1.20) (3751):Read: size: 0
5/8 11:17:45 (1.20) (3751):Read: opens: 1
5/8 11:17:45 (1.20) (3751):Read: working dir =
/dcs/condor/condor/riteshfiles/riteshgrid/./S_20/
5/8 11:17:45 (1.20) (3751):Read: Done saving file state
5/8 11:17:45 (1.20) (3751):Read: About to update MyImage
5/8 11:17:45 (1.20) (3751):Read: Size of ckpt image = 66511871 bytes
5/8 11:17:45 (1.20) (3751):Read: About to write checkpoint
5/8 11:17:45 (1.20) (3751):Read: Image::Write(): fd -1 file_name
/var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0
5/8 11:17:45 (1.20) (3751):Read: Checkpoint name is
"/var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0"
5/8 11:17:45 (1.20) (3751):Read: Tmp name is
"/var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0.tmp"
5/8 11:17:45 (1.20) (3751): Entering pseudo_put_file_stream
5/8 11:17:45 (1.20) (3751): file =
"/var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0.tmp"
5/8 11:17:45 (1.20) (3751): len = 66511871
5/8 11:17:45 (1.20) (3751): owner = condor
5/8 11:17:45 (1.20) (3751): Weird 0xf77cd89
5/8 11:17:45 (1.20) (3751):Returned addr
5/8 11:17:45 (1.20) (3751): 137.205.119.15
5/8 11:17:45 (1.20) (3751):Returned port 53075
5/8 11:17:45 (1.20) (3751):Read: connect() failed - errno = 111
5/8 11:17:45 (1.20) (3751):Read: open_tcp_stream() failed
5/8 11:17:45 (1.20) (3751):Read: ERROR:open_ckpt_file failed, aborting
ckpt
5/8 11:17:45 (1.20) (3751):Read: Ckpt exit
5/8 11:17:45 (1.20) (3751):Read: Write failed with [-1]
5/8 11:17:45 (1.20) (3751):Shadow: Job 1.20 exited, termsig = 9, coredump
= 0, retcode = 0
5/8 11:17:45 (1.20) (3751):Shadow: Job was kicked off without a checkpoint
5/8 11:17:45 (1.20) (3751):Shadow: DoCleanup: unlinking TmpCkpt
'/var/tmp/dcsoff-15-condor/spool/cluster1.proc20.subproc0.tmp'
The above is from the ShadowLog output. As you can see there is the port
being opened on 53075 which is not in the range. We have got a checkpoint
server setup (as above) and 'clients' are configured to use it. What is
very odd is that all of the checkpoints seem to come back to the server to
be written - in the TransferLog all of the receives are from the server
not the clients - should this be the case?
I'm just not sure why the port is incorrect? Does checkpointing work by
opening a port to copy the file over onto - if so why does it not use one
in the range 9000-9500?
Thanks for your help,
Si Hammond
Univ. of Warwick