HTCondor Project List Archives



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-devel] mmap issue



Hi team,

I am periodically seeing the following crash when the new quill tries to
compute checksum of an output file used by a job. I used the mmap function
to map in the data from the file before passing it to the checksum
function. The return status is checked to make sure that it is not
MAP_FAILED before doing checksum.

Notice that the output file is shared by multiple jobs submitted in a
batch. So it is very likely that the file gets modified when or after the
mmap was called. Could this be the problem? Is there any way we can lock
the file before doing the checksum to prevent the file being modified by
other processes (e.g shadow)? A better question to ask may be whether
a shadow does locking on a file before trying to modify it?

Here is the stack trace from the crash. Thanks.

#0  WriteCoreDump (file_name=0x19 <Address 0x19 out of bounds>)
    at src/coredumper.c:137
#1  0x080fc52c in linux_sig_coredump (signum=7) at daemon_core_main.C:540
#2  <signal handler called>
#3  0x082a64ef in md5_block_host_order (c=0x8fe8150, data=0xb7851000,
    num=100033) at md5_dgst.c:99
#4  0x082a602e in MD5_Update (c=0x8fe8150, data_=0xb7851000, len=7086263)
    at md32_common.h:487
#5  0x0815e282 in Condor_MD_MAC::addMD (this=0x84f34b8,
    buffer=0xb7851000 <Address 0xb7851000 out of bounds>, length=7086263)
    at condor_md.C:137
#6  0x080e1e7f in file_checksum (
    filePathName=0xbfffe480 "/scratch/matrix_tests/job_matrixab_10min_van.jhuang
.out", fileSize=7086263, sum=0xbfffe430 "") at ttmanager.C:3356
#7  0x080df42d in TTManager::insertFiles (this=0x84211c0, ad=0x8feb598)
    at ttmanager.C:2827
#8  0x080d1cb6 in TTManager::event_maintain (this=0x84211c0) at ttmanager.C:573
#9  0x080d0cc4 in TTManager::maintain (this=0x84211c0) at ttmanager.C:245
#10 0x080d0b69 in TTManager::pollingTime (this=0x84211c0) at ttmanager.C:197
#11 0x0812312a in TimerManager::Timeout (this=0x8421840) at timer_manager.C:422
#12 0x080f0248 in DaemonCore::Driver (this=0x844b8f0) at daemon_core.C:2316
#13 0x080fed68 in main (argc=1, argv=0xbffff2ac) at daemon_core_main.C:2022