Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor at AEI -- revisited, now with segfault!
- Date: Tue, 8 Apr 2008 17:53:47 +0200 (CEST)
- From: Lucia Santamaria <lucia.santamaria@xxxxxxxxxx>
- Subject: [Condor-users] Condor at AEI -- revisited, now with segfault!
Hi everybody,
thanks a lot for your answer regarding my evicted jobs; indeed, I should
have been more careful and send the logs to a local directory instead of
writing to nfs.
Now we're facing another problem with condor in deepthought, which
prevents us from even getting to the point where the previous problem was
happening. Now the dags die with unexplained signal 11 ~2-4 min after
submission.
I have located the first job that dies and I have run it with strace after
setting
environment =
_CONDOR_DAGMAN_LOG=ihope.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0
The call for strace is:
lucia@deepthought:/scratch/tmp/lucia/play_local/857232370-859651570$
strace /usr/bin/condor_dagman -f -l . -Debug 3 -Lockfile ihope.dag.lock
-AutoRescue 1 -DoRescueFrom 0 -Condorlog mylog.log -Dag ihope.dag
strace.out 2>strace.err
which produces an empty mylog.log file, an empty strace.out file and a
non-empty strace.err with a SEGFAULT (aha!).
You can find it here:
http://pandora.aei.mpg.de/~lucia/strace.err
Also the corresponding ihope.dag.dagman.out is here:
http://pandora.aei.mpg.de/~lucia/ihope.dag.dagman.out
and you see that there's no error message at the end it simply stops.
Also I must add that:
lucia@deepthought:~$ ldd /usr/bin/condor_dagman
libdl.so.2 => /lib/libdl.so.2 (0x00002b3137ed9000)
libcrypt.so.1 => /lib/libcrypt.so.1 (0x00002b3137fdd000)
libresolv.so.2 => /lib/libresolv.so.2 (0x00002b3138111000)
libstdc++.so.5 => /usr/lib/libstdc++.so.5 (0x00002b3138226000)
libm.so.6 => /lib/libm.so.6 (0x00002b3138403000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00002b3138585000)
libc.so.6 => /lib/libc.so.6 (0x00002b3138692000)
/lib64/ld-linux-x86-64.so.2 (0x00002b3137dc1000)
This is the 7.10-pre one, unchanged from Kent's upload,
dated Mar 19, 5230376 bytes
I'd like to track down this segfault myself, but you might understand that
the output in strace.err scares me a bit.
Thank you very much for any insight you can provide.
Lucia
--
--------------------------------------------
Lucia Santamaria
Max-Planck-Institut fuer Gravitationsphysik
Albert-Einstein-Institut
Am Muehlenberg 1, 17746 Golm, Germany
Office: +49(0)331-567-7181
---------------------------------------------