[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Errno=14, taking checkpoint doesnot complete
- Date: Thu, 12 Mar 2009 17:43:12 -0400
- From: Tanzima Zerin Islam <tz.islam@xxxxxxxxx>
- Subject: [Condor-users] Errno=14, taking checkpoint doesnot complete
Hi, I have an application compiled with condor_compile. I am trying to run it in standalone way using:
./executable input -_condor_D_ALL
then from another shell I am sending checkpoint signal : kill -USR2 pid
But this is what I get:
..............................................
Got SIGUSR2
Saved signal state.
About to save file state
CondorFileTable::checkpoint
OPEN FILE TABLE:
fd 0
logical name: default stdin
offset: 0
dups: 1
open flags: 0x0
not currently bound to a url.
fd 1
logical name: default stdout
offset: 315
dups: 1
open flags: 0x1
url: fd:1
size: 315
opens: 1
fd 2
logical name: default stderr
offset: 0
dups: 1
open flags: 0x1
not currently bound to a url.
working dir = /home/yara/sbagchi/tislam/condorExperiments/spec_429.mcf
Done saving file state
About to update MyImage
Adding a DATA segment: start[0x659000], end [0x694cd000]
Image::AddSegment: name=[DATA], start=[659000], end=[694cd000], length=[0x68e74000], prot=[0xffffffff00000000]
Adding a STACK segment: start[0x7fffbfa5d000], end [0x7fffbfa66fff]
Image::AddSegment: name=[STACK], start=[7fffbfa5d000], end=[7fffbfa66fff], length=[0x9fff], prot=[0x0]
Pos: 1759986720
Pos: 1760027679
Size of ckpt image = 1760027679 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./mcf.ckpt
Checkpoint name is "./mcf.ckpt"
Tmp name is "./mcf.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0x659000,len=0x68e74000)
I wrote 745472 bytes with write...
I wrote -1 bytes with write...
in SegMap::Write(): fd = 3, write_size=1759240192
errno=14, core_loc=70f000
Write() Segment[0] of type DATA -> FAILED
errno = 14, nbytes = -1
Periodic Ckpt complete, doing a virtual restart...
About to restore file state
CondorFileTable::resume
working dir = /home/mcf
OPEN FILE TABLE:
fd 0
logical name: default stdin
offset: 0
dups: 1
open flags: 0x0
not currently bound to a url.
fd 1
logical name: default stdout
offset: 315
dups: 1
open flags: 0x1
not currently bound to a url.
fd 2
logical name: default stderr
offset: 0
dups: 1
open flags: 0x1
not currently bound to a url.
Done restoring file state
About to restore signal state
About to return to user code
..............................................
This debug message clearly shows some error occurred so I only see mcf.ckpt.tmp being generated.
Any idea what errno=14 means? checkpoint's size might be the reason?
--Tan