Dear all,
I am coming back to the hot “checkpointing the vanilla universe issue”
under windows. I have a fortran 90 code which can run for a while. For longer
runs, condor’s performance drops significantly as jobs get interrupted by users
and with the lack of a native checkpointing function and the inability to use
the “standard" universe the code has to restart from the beginning on a
different machine. As a result seldom any jobs manage to finish off. I changed
my source code to accommodate a check pointing feature. The code reads a “flag”
file (which is also one of the initial input files) and creates a checkpoint
file with all the required data to be able to resume a job from where it was
left off. The flag file initially contains a “0”. As soon as a given elapsed
time passes (1hr and then every one hour from there onwards) the first
checkpoint takes place. The flag file is supposed to be updated with a value of
“1” and a “history” file is created saving the required checkpoint data. The
idea is that when the code gets evicted, it will read the input file as “1” and
then use the “history” file to read the last checkpoint data and resume from
where it left off. This doesn’t seem to be working. I am quite confused if the
flag file gets updated and re-read upon re-starting the job. I am also not sure
if condor will be able to read the “history” file which was created as an output
file and is not in the initial input files list.
Any ideas?
This is the current submit file I am using to accommodate the checkpoint
function:
************************
************************
Requirements = (Memory >=900) && (Arch=="X86_64") &&
(OpSys=="WINDOWS")
Executable = \\htcondor\htcondorjobs\\****\T2\mds.exe
initialdir = \\htcondor\htcondorjobs\\****\T2
transfer_input_files = mds.exe, input, flag
Universe = vanilla
Getenv = False
output = Test_cores.out
error = Test_cores.err
log = Test_cores.log
should_transfer_files = ALWAYS
when_to_transfer_output = ON_EXIT_OR_EVICT
periodic_release = TRUE
Queue 250
************************
************************
Regards
Antonis |