Subject: [Condor-users] A Problem while restarting a checkpoint file
Hi, I have written a shell script that runs a helloworld (C) program from shell script and from that shell script sends kill -USR2 signal to that process. I have used condor_compile to link my executable with condor's checkpoint library.
This shell script also has a way to indentify whether the process needs to restart from an existing ckpt file or is a new application. Both way it works fine until - after the job is restarted, when again my shell script sends a kill -USR2 signal, it terminated abnormally.
The debug output shows that working dir is null. The debug output is as follows:
Test.sh: Sending checkpoint signal to process: 22037 Got SIGUSR2 Saved signal state. About to save file state CondorFileTable::checkpoint
OPEN FILE TABLE: fd 0 logical name: default stdin offset: 0 dups: 1 open flags: 0x0 not currently bound to a url. fd 1 logical name: default stdout
offset: 820 dups: 1 open flags: 0x1 url: fd:1 size: 820 opens: 1 fd 2 logical name: default stderr offset: 0
dups: 1 open flags: 0x1 not currently bound to a url. working dir = Done saving file state About to update MyImage
Adding a DATA segment: start[0xlx], end [0xlx] Image::AddSegment: name=[DATA], start=[653000], end=[70b000], length=[0xlx], prot=[0xb8000] Adding a STACK segment: start[0xlx], end [0xlx] Image::AddSegment: name=[STACK], start=[7fbfff6000], end=[7fbfffffff], length=[0xlx], prot=[0x9fff]
Pos: 754720 Pos: 795679 Size of ckpt image = 795679 bytes About to write checkpoint Image::Write(): fd -1 file_name ./helloWorld.ckpt Checkpoint name is "./helloWorld.ckpt" Tmp name is "./helloWorld.ckpt.tmp"
Wrote headers OK Wrote all SegMaps OK write(fd=3,core_loc=0xlx,len=0xlx) I wrote 753664 bytes with write... Wrote Segment[0] of type DATA -> OK write(fd=3,core_loc=0xlx,len=0xlx) I wrote 40959 bytes with write...
Wrote Segment[1] of type STACK -> OK Wrote all Segments OK About to close ckpt fd (3) Closed OK About to rename "./helloWorld.ckpt.tmp" to "./helloWorld.ckpt" Renamed OK USER PROC: CHECKPOINT IMAGE SENT OK
Periodic Ckpt complete, doing a virtual restart... About to restore file state CondorFileTable::resume working dir = Condor: Error: Couldn't move to '��p' (No such file or directory). Please fix it.
./job.sh: line 61: 22037 Killed ./helloWorld -_condor_restart helloWorld.ckpt
----------------------- Now see the working dir line -- why does it not show the working directory? I have restarted the process as: ./helloWorld -_condor_restart helloWorld.ckpt
So the problem is: After a job is restarted from last checkpoint - it cannot be checkpointed again by sending USR2 or CTRL+Z signal.
Does anyone know any remedy?