[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Schedd Exiting With Status 1



Hello All:

We are currently running Condor 7.4.4 on a RHEL 5.4 dedicated cluster.
About 50% of the jobs are vanilla, and the other 50% are parallel
universe jobs.

We have seen the schedd hang numerous times (7 times in the last 12
hours) - each time getting killed by the master - usually when running
parallel universe jobs. It can take an hour or more for the master to
kill the hung schedd. This usually results in a restart of all of the
running DAGs (without a rescue dag being written).

Here is the trace from the schedd log:
Stack dump for process 4858 at timestamp 1300249227 (23 frames)
condor_schedd(dprintf_dump_stack+0xb7)[0x5e5800]
condor_schedd(_Z18linux_sig_coredumpi+0x2c)[0x5d5d28]
/lib64/libpthread.so.0[0x343ca0e7c0]
/lib64/libc.so.6(abort+0x28f)[0x33fa831e8f]
/lib64/libc.so.6[0x33fa86a84b]
/lib64/libc.so.6[0x33fa870583]
/lib64/libc.so.6[0x33fa872a1a]
/lib64/libc.so.6(__libc_malloc+0x6e)[0x33fa874bee]
/usr/lib64/libstdc++.so.6(_Znwm+0x1d)[0x37e4ebd17d]
/usr/lib64/libstdc++.so.6(_Znam+0x9)[0x37e4ebd299]
condor_schedd(_ZN9HashTableI10YourStringP12AttrListElemE17resize_hash_tableEi0x3a)[0x6b60d4] condor_schedd(_ZN9HashTableI10YourStringP12AttrListElemE7addItemERKS0_RKS2_0x11c)[0x6b635e]   condor_schedd(_ZN9HashTableI10YourStringP12AttrListElemE6insertERKS0_RKS2_0x10f)[0x6b6479]
condor_schedd(_ZN8AttrListC2ERS_+0x21c)[0x6b22f6]
condor_schedd(_ZN7ClassAdC1ERKS_+0x1e)[0x6b7f3a]
condor_schedd(_ZN18DedicatedScheduler13sortResourcesEv0x281)[0x571bfd]
condor_schedd(_ZN18DedicatedScheduler19handleDedicatedJobsEv0xc2)[0x579800]
condor_schedd(_ZN18DedicatedScheduler23callHandleDedicatedJobsEv0x20)[0x57990c]
condor_schedd(_ZN12TimerManager7TimeoutEv+0x24b)[0x5e09ef]
condor_schedd(_ZN10DaemonCore6DriverEv+0x82a)[0x5bf944]
condor_schedd(main+0x18eb)[0x5d7fd7]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x33fa81d994]
condor_schedd(__gxx_personality_v0+0x421)[0x5253a9]

Has anyone seen this before? Is this a problem with our configuration or
a known condor bug? If this is a bug, has this been fixed in the devel
series? 

Thanks for all of your help,
DJH