Hi team,
We've seen schedd core dumps at a customer site (HTCondor 8.2.7 on
64-bit CentOS 6). They've been running much shorter jobs than we had
originally planned for, so my suspicion is that part of the problem is
that the spool is on a persistent EBS volume instead of the
instance-local ephemeral disk.
Unfortunately, I don't have the logs. I've poked back at them to try
to get them but they may have rotated away by now. But I do have two
separate core dumps from two separate hosts that fail in the same
place.
I can provide the core files off-list, but here's what I was able to
find with gdb. Does it look familiar to anyone?
[New Thread 8150]
Reading symbols from /lib64/ld-linux-x86-64.so.2...Reading symbols
from /usr/lib/debug/lib64/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `condor_schedd -f -local-name Q1'.
#0 0x00007fb69272249c in ?? ()
(gdb) bt
#0 0x00007fb69272249c in ?? ()
#1 0x0000000000737472 in PrioRecArray ()
#2 0x0000000000000031 in ?? () at
/slots/02/dir_42284/userdir/src/condor_utils/list.h:516
#3 0x00007fb69244b6d0 in ?? ()
#4 0x0000000002809cb0 in ?? ()
#5 0x0000000000000008 in ?? () at
/slots/02/dir_42284/userdir/src/condor_utils/list.h:288
#6 0x0000000000000000 in ?? ()
(gdb) frame 1
#1 0x0000000000737472 in PrioRecArray ()
(gdb) frame 2
#2 0x0000000000000031 in ?? () at
/slots/02/dir_42284/userdir/src/condor_utils/list.h:516
warning: Source file is more recent than executable.
516 item->next->prev = item->prev;
(gdb) fram 5
#5 0x0000000000000008 in ?? () at
/slots/02/dir_42284/userdir/src/condor_utils/list.h:288
288 List<ObjType>::~List()
(gdb)
--
Ben Cotton
main: 888.292.5320
Cycle Computing
Better Answers. Faster.
http://www.cyclecomputing.com
twitter: @cyclecomputing
|