I don’t have any ideas, but I would be interested to see where you get to on this. I would be interested in using the checkpointing (dump files) in LS-DYNA to restart an analysis. Has anyone been using LS-DYNA under windows? It might be a similar case. Andrew From: htcondor-users-bounces@xxxxxxxxxxx [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Antonis Sergis Dear all, I am coming back to the hot “checkpointing the vanilla universe issue” under windows. I have a fortran 90 code which can run for a while. For longer runs, condor’s performance drops significantly as jobs get interrupted by users and with the lack of a native checkpointing function and the inability to use the “standard" universe the code has to restart from the beginning on a different machine. As a result seldom any jobs manage to finish off. I changed my source code to accommodate a check pointing feature. The code reads a “flag” file (which is also one of the initial input files) and creates a checkpoint file with all the required data to be able to resume a job from where it was left off. The flag file initially contains a “0”. As soon as a given elapsed time passes (1hr and then every one hour from there onwards) the first checkpoint takes place. The flag file is supposed to be updated with a value of “1” and a “history” file is created saving the required checkpoint data. The idea is that when the code gets evicted, it will read the input file as “1” and then use the “history” file to read the last checkpoint data and resume from where it left off. This doesn’t seem to be working. I am quite confused if the flag file gets updated and re-read upon re-starting the job. I am also not sure if condor will be able to read the “history” file which was created as an output file and is not in the initial input files list. Any ideas? This is the current submit file I am using to accommodate the checkpoint function: ************************ ************************ Requirements = (Memory >=900) && (Arch=="X86_64") && (OpSys=="WINDOWS") Executable = \\htcondor\htcondorjobs\\****\T2\mds.exe initialdir = \\htcondor\htcondorjobs\\****\T2 transfer_input_files = mds.exe, input, flag Universe = vanilla Getenv = False output = Test_cores.out error = Test_cores.err log = Test_cores.log should_transfer_files = ALWAYS when_to_transfer_output = ON_EXIT_OR_EVICT periodic_release = TRUE Queue 250 ************************ ************************ Regards Antonis ____________________________________________________________ |