HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] Issues with output files and checkpointing



We have run across two issues dealing with interactions between output
files and checkpointing.


The first issue is the appearance of extra characters at the end of a
file.  The problem scenario is that a job runs to completion (at least
as far as the amount of output to be written) and is then evicted and
restarted from a previous checkpoint.  If the length of the generated
output now changes (results dependent on random number generation,
floating point results printed to high precision, etc.) then the final
desired output may be shorter than what has already been written.

Question: Does Condor know both the size and offset of a file when the
checkpoint is taken (perusing the ShadowLog would seem to imply yes)?
If so, would it be possible on checkpoint restart to truncate the file
to the known size at the time the checkpoint was taken?

Hack: A source level fix is to "ftruncate(fd, lseek(fd, 0L, SEEK_CUR))"
when the file is closed.  Of course, it would be preferable to not have
to modify all of our source code.


The second issue is the appearance of null characters in the middle of
a file.  The problem scenario is that a job runs for a while and is
then evicted and checkpointed before accumulating MaxDiscardedRunTime.
On restart, if the checkpoint can not be read the job will begin again
from scratch (with the result that files will most likely be reopened
with O_TRUNC).  If the job is then evicted without a new checkpoint
being taken, it is now possible for it to restart from the previous
checkpoint, but with the output files truncated (and possibly only
partially rewritten).  This leads directly to "holes" of zeroes in the
output files.

Solution: If the checkpoint can not be read, reset the LastCkptServer
attribute in the JobAd so we will not try to read this particular
checkpoint again.  A patch for this follows:


--- condor_shadow.V6/pseudo_ops.C.ORIG	Fri Oct  7 02:10:42 2005
+++ condor_shadow.V6/pseudo_ops.C	Tue Jan 24 12:28:58 2006
@@ -2287,6 +2287,21 @@
 			}
 		}
 	} while(rval == -1 && LastCkptServer && accum_usage > MaxDiscardedRunTime);
+	if (rval != 1 && LastCkptServer) {
+		extern char *schedd;
+
+		dprintf(D_ALWAYS, "failed to stat checkpoint file on %s,"
+				" deleting attributes\n", LastCkptServer);
+		if (!ConnectQ(schedd, SHADOW_QMGMT_TIMEOUT)) {
+			EXCEPT("failed to connect to schedd");
+		}
+		DeleteAttribute(Proc->id.cluster, Proc->id.proc, ATTR_LAST_CKPT_SERVER);
+		DeleteAttribute(Proc->id.cluster, Proc->id.proc, ATTR_LAST_CKPT_TIME);
+		DisconnectQ(NULL);
+	 	free(LastCkptServer);
+		LastCkptServer = NULL;
+		LastCkptTime = -1;
+	}
 	if (rval == -1) { /* not on local disk & not using ckpt server */
 		rval = FALSE;
 	}


-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison