HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] file transfer "random" crashes due to files totalling more than 2GB



condor 6.6.10 (and I am guessing HEAD on the stable stream) is totally
broken in its handling of large files

obviously >2GB individual files just don't work but if they sum to
that intotal they don't either

from file_transfer.C - My comments added

int
FileTransfer::DoUpload(ReliSock *s)
{
char *filename;
char *basefilename;
char fullname[_POSIX_PATH_MAX];
int bytes, total_bytes=0;    //MATTH Arrrgh
bool is_the_executable;
StringList * filelist = FilesToSend;

dprintf(D_FULLDEBUG,"entering FileTransfer::DoUpload\n");

priv_state saved_priv = PRIV_UNKNOWN;
if( want_priv_change ) {
saved_priv = set_priv( desired_priv_state );
}

s->encode();

// tell the server if this is the final transfer our not.
// if it is the final transfer, the server places the files
// into the user's Iwd.  if not, the files go into SpoolSpace.
if( !s->code(m_final_transfer_flag) ) {
return_and_resetpriv( -1 );
}
if( !s->end_of_message() ) {
return_and_resetpriv( -1 );
}
if ( filelist ) {
filelist->rewind();
}

// get ourselves a local copy that will be cleaned up if we exit
char *tmpSpool = param("SPOOL");
char Spool[_POSIX_PATH_MAX];
if (tmpSpool) {
strcpy (Spool, tmpSpool);
free (tmpSpool);
}

while ( filelist && (filename=filelist->next()) ) {

dprintf(D_FULLDEBUG,"DoUpload: send file %s\n",filename);

// okay, what we do here is undo the old priv change. then,
// depending on if the file is in spool or not, we switch to
// condor priv for a spool file or else back to desired priv.
if (saved_priv != PRIV_UNKNOWN) {
         _set_priv(saved_priv,__FILE__,__LINE__,1);
}
// look at the filename to see if it starts with the SPOOL dir
if (strncmp(Spool, filename, strlen(Spool)) == 0) {
saved_priv = set_condor_priv();
} else {
saved_priv = PRIV_UNKNOWN;
if( want_priv_change ) {
saved_priv = set_priv( desired_priv_state );
}
}

if( !s->snd_int(1,FALSE) ) {
dprintf(D_FULLDEBUG,"DoUpload: exiting at %d\n",__LINE__);
return_and_resetpriv( -1 );
}
if( !s->end_of_message() ) {
dprintf(D_FULLDEBUG,"DoUpload: exiting at %d\n",__LINE__);
return_and_resetpriv( -1 );
}

if ( ExecFile && ( file_strcmp(ExecFile,filename)==0 ) ) {
// this file is the job executable
is_the_executable = true;
basefilename = (char *)CONDOR_EXEC ;
} else {
// this file is _not_ the job executable
is_the_executable = false;
basefilename = basename(filename);
}

if( !s->code(basefilename) ) {
dprintf(D_FULLDEBUG,"DoUpload: exiting at %d\n",__LINE__);
return_and_resetpriv( -1 );
}

if( filename[0] != '/' && filename[0] != '\\' && filename[1] != ':' ){
sprintf(fullname,"%s%c%s",Iwd,DIR_DELIM_CHAR,filename);
} else {
strcpy(fullname,filename);
}

// check for read permission on this file, if we are supposed to check.
// do not check the executable, since it is likely sitting in the SPOOL
// directory.
#ifdef WIN32
if( perm_obj && !is_the_executable &&
(perm_obj->read_access(fullname) != 1) ) {
// we do _not_ have permission to read this file!!
dprintf(D_ALWAYS,"DoUpload: Permission denied to read file %s!\n",
fullname);
dprintf(D_FULLDEBUG,"DoUpload: exiting at %d\n",__LINE__);
return_and_resetpriv( -1 );
}
#endif

/* MATTH ok so here is where a single 2GB file will overflow and
become negative */

if( ((bytes = s->put_file(fullname)) < 0) ) {
/* if we fail to transfer a file, EXCEPT so the other side can */
/* try again. SC2000 hackery. _WARNING_ - I think Keller changed */
/* all of this. -epaulson 11/22/2000 */
EXCEPT("DoUpload: Failed to send file %s, exiting at %d\n",
fullname,__LINE__);
return_and_resetpriv( -1 );
}

if( !s->end_of_message() ) {
dprintf(D_FULLDEBUG,"DoUpload: exiting at %d\n",__LINE__);
return_and_resetpriv( -1 );
}

/* MATTH and here is where a combination of files greater than 2 GB <
4GB will kill it */

total_bytes += bytes;
}

// tell our peer we have nothing more to send
s->snd_int(0,TRUE);

dprintf(D_FULLDEBUG,"DoUpload: exiting at %d\n",__LINE__);

bytesSent += total_bytes;

return_and_resetpriv( total_bytes );
}

since

int
FileTransfer::Upload(ReliSock *s, bool blocking)
{
dprintf(D_FULLDEBUG,"entering FileTransfer::Upload\n");

if (ActiveTransferTid >= 0) {
EXCEPT("FileTransfer::Upload called during active transfer!\n");
}

Info.duration = 0;
Info.type = UploadFilesType;
Info.success = true;
Info.in_progress = true;
TransferStart = time(NULL);

if (blocking) {

Info.bytes = DoUpload((ReliSock *)s);
Info.duration = time(NULL)-TransferStart;
Info.success = (Info.bytes >= 0); // MATTH oh dear
Info.in_progress = false;
return Info.success;

Nastily this simply causes an ASSERT to blow

ERROR "Assertion ERROR on (filetrans->UploadFiles(true,
final_transfer))" at line 336 in file
..\src\condor_starter.V6.1\jic_shadow.C

and the job to rerun since on the shadow side we just get a windows
10054 as (I assume) the socket is simply closed on us

11/30 19:09:04 (8.0) (5232): condor_read(): recv() returned -1, errno
= 10054, assuming failure.

The job will of course rerun and rerun. Most users spot that the job
is rerunning despite them having all their files back. think "That's
weird", put it down to a network glitch making the shadow not finish
its conversation with the starter and thus believing the job ran again
kill the job and aren't too unhappy because at least they got their
data back.

I was looking at this because a user has already hit the single 2GB
file limit and was going to split things up. In fact this won't help
him at all.

There is only one possible workaround for them which is to
deliberately ensure that, when files go above 2GB total they force it
even higher to 4GB with a few 1GB files!

If this isn't fixed in the stable stream this is a total showstopper
for most of my users.

Please fix this in the 6.6.x stream, just doing

long bytes, total_bytes=0;    // better

bytesSent is a float so should not cause a problem

then wrapping

return_and_resetpriv( total_bytes );

with

int total_bytes_as_int = (int)total_bytes
if (total_bytes_as_int < 0)
{
    dprintf(D_FULLDEBUG,"DoUpload: total bytes transfered %d greater
than %d capping as a workaround\n", total_bytes, MAX_INT_VALUE);
    total_bytes_as_int = MAX_INT_VALUE;
}
return_and_resetpriv( total_bytes_as_int );

thus preventing any immediate issues.
Obviously DoUpload should return long not int but at least this would
fix it even as a hack.

I think posting the basics if not the internals of this to condor
users would be a good idea to let them know that attempting to
transfer back more than 2GB in total simply won't work on the stable
stream. But I leave that up to you in case you want to relase details
of any fix at the same time

Matt

P.S. Unfortunately my systems people have gone home so I'll cc in
condor-admin@xxxxxxxxxxx as well