Hi,
I have written a shell script that runs a helloworld (C) program from
shell script and from that shell script sends
kill -USR2 signal to that process. I have used condor_compile to link
my executable with condor's checkpoint library.
This shell script also has a way to indentify whether the process
needs to restart from an existing ckpt file or is a new application.
Both way it works fine until - after the job is restarted, when again
my shell script sends a kill -USR2 signal, it terminated abnormally.
The debug output shows that working dir is null.
The debug output is as follows:
Test.sh: Sending checkpoint signal to process: 22037
Got SIGUSR2
Saved signal state.
About to save file state
CondorFileTable::checkpoint
OPEN FILE TABLE:
fd 0
logical name: default stdin
offset: 0
dups: 1
open flags: 0x0
not currently bound to a url.
fd 1
logical name: default stdout
offset: 820
dups: 1
open flags: 0x1
url: fd:1
size: 820
opens: 1
fd 2
logical name: default stderr
offset: 0
dups: 1
open flags: 0x1
not currently bound to a url.
working dir =
Done saving file state
About to update MyImage
Adding a DATA segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[DATA], start=[653000], end=[70b000],
length=[0xlx], prot=[0xb8000]
Adding a STACK segment: start[0xlx], end [0xlx]
Image::AddSegment: name=[STACK], start=[7fbfff6000], end=[7fbfffffff],
length=[0xlx], prot=[0x9fff]
Pos: 754720
Pos: 795679
Size of ckpt image = 795679 bytes
About to write checkpoint
Image::Write(): fd -1 file_name ./helloWorld.ckpt
Checkpoint name is "./helloWorld.ckpt"
Tmp name is "./helloWorld.ckpt.tmp"
Wrote headers OK
Wrote all SegMaps OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 753664 bytes with write...
Wrote Segment[0] of type DATA -> OK
write(fd=3,core_loc=0xlx,len=0xlx)
I wrote 40959 bytes with write...
Wrote Segment[1] of type STACK -> OK
Wrote all Segments OK
About to close ckpt fd (3)
Closed OK
About to rename "./helloWorld.ckpt.tmp" to "./helloWorld.ckpt"
Renamed OK
USER PROC: CHECKPOINT IMAGE SENT OK
Periodic Ckpt complete, doing a virtual restart...
About to restore file state
CondorFileTable::resume
working dir =
Condor: Error: Couldn't move to '��p' (No such file or directory).
Please fix it.
./job.sh: line 61: 22037 Killed ./helloWorld
-_condor_restart helloWorld.ckpt
-----------------------
Now see the working dir line -- why does it not show the working
directory? I have restarted the process as:
./helloWorld -_condor_restart helloWorld.ckpt
So the problem is: After a job is restarted from last checkpoint - it
cannot be checkpointed again by sending USR2 or CTRL+Z signal.
Does anyone know any remedy?
-- Tan
------------------------------------------------------------------------
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/