Steffen,
I have suspicion that it could be something to do LDAP authentication, not
with AMD64,
because we are trying to install Condor 6.6.8 on a Linux cluster, running
RH9 on dual-Xeon nodes
and getting similar crashes (SCHEDD ...died on signal 11) when it fails to
identify the user, no matter
whether we use NFS or not.
Does anyone else successfully run Condor on systems where UIDs/GIDs are not
provided by
passwd file but via LDAP?
Andrey
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of
Steffen Prohaska
Sent: Friday, March 04, 2005 3:16 PM
To: Condor-Users Mail List
Subject: [Condor-users] AMD Opteron Crashes
Hi,
In
https://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/
msg01368.shtml I read that it should be possible to use the
linux-x86-glibc23-dynamic binary on an 64 bit Opteron system to run
Condor.
Everything's working fine until condor tries to start a job. The
condor_starter crashes with a SEGFAULT.
I tried this with the condor-6.6.8-linux-x86-glibc22-dynamic.tar.gz,
condor-6.6.8-linux-x86-glibc23-dynamic.tar.gz, and the
condor-6.7.5-linux-x86-glibc23-dynamic.tar.gz. The behaviour
is always
similar. We're running a Suse Enterprise Linux. User information is
stored in LDAP. I attached excerpts from log files below. If more
details were helpful, I could also provide them.
Any thoughts on this? Is anyone successfully running Condor on a
similar Opteron system?
Steffen
--- System info
acorn:/ # cat /etc/SuSE-release
SUSE LINUX Enterprise Server 9 (x86_64)
VERSION = 9
acorn:/ # uname -a
Linux acorn 2.6.5-7.139-smp #1 SMP Fri Jan 14 15:41:33 UTC
2005 x86_64
x86_64 x86_64 GNU/Linux
--- From StartLog:
StartLog:3/4 15:48:32 Starter pid 18488 died on signal 11 (signal 11)
--- From /var/log/messages
Mar 4 15:48:32 acorn kernel: condor_starter[18488]: segfault at
00000000a4e0efc5 rip 00000000559a4dac rsp 00000000ffffc4a8 error 4
--- From StarterLog.vm2
3/4 15:48:29 (fd:9) PASSWD_CACHE_REFRESH is undefined, using default
value of 300
3/4 15:48:29 (fd:9) Finding local host information, calling
gethostname()
[...]
3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): getpwnam("condor")
failed: user not found
3/4 15:48:29 (fd:9) passwd_cache::cache_uid(): getpwnam("condor")
failed: user not found
3/4 15:48:29 (fd:9) PRIV_UNKNOWN --> PRIV_CONDOR at
daemon_core_main.C:1382
3/4 15:48:29 (fd:9) KEYCACHE: created: 82ca8d8
3/4 15:48:29 (fd:9)
******************************************************
3/4 15:48:29 (fd:9) ** condor_starter (CONDOR_STARTER) STARTING UP
3/4 15:48:30 (fd:9) **
/vis/data/people/condor/linux-glibc23/sbin/condor_starter
3/4 15:48:30 (fd:9) ** $CondorVersion: 6.6.8 Jan 27 2005 $
3/4 15:48:30 (fd:9) ** $CondorPlatform: I386-LINUX_RH9 $
3/4 15:48:30 (fd:9) ** PID = 18488
3/4 15:48:30 (fd:9) ** Running as root: Privilege switching in effect
3/4 15:48:30 (fd:9)
******************************************************
[...]
TransferSocket = "<130.73.68.82:21118>"
ShadowVersion = "$CondorVersion: 6.6.8 Jan 27 2005 $"
UidDomain = "zib.de"
3/4 15:48:32 (fd:11) --- End of ClassAd ---
3/4 15:48:32 (fd:11) STARTER_TIMEOUT_MULTIPLIER is undefined, using
default value of 0
3/4 15:48:32 (fd:11) New Daemon obj (shadow) name: "onyx3.zib.de",
pool: "NULL", addr: "NULL"
3/4 15:48:32 (fd:11) Version of Shadow is $CondorVersion:
6.6.8 Jan 27
2005 $
3/4 15:48:32 (fd:11) Starter communicating with condor_shadow
<130.73.68.82:21118>
3/4 15:48:32 (fd:11) Submitting machine is "onyx3.zib.de"
3/4 15:48:32 (fd:11) Doing CONDOR_register_starter_info
3/4 15:48:32 (fd:11) ShouldTransferFiles is "NO", NOT
transfering files
3/4 15:48:32 (fd:11) Submit UidDomain: "zib.de"
3/4 15:48:32 (fd:11) Local UidDomain: "zib.de"
3/4 15:48:32 (fd:11) Initialized user_priv as "..."
[ at this time the daemon crashes ]
--- End of log
--
Steffen Prohaska <prohaska@xxxxxx> <http://www.zib.de/prohaska/>
Zuse Institute Berlin, Takustraße 7, D-14195 Berlin-Dahlem, Germany
+49 (30) 841 85-337, fax -107
1024D/DA749299 print 8B59 83A8 A43D E0E2 DEDB D479 3157
2FEA DA74 9299