Hi Andrey, My setup is not too different from yours; I'm just using SL-3.0.8 (due to gLite middleware dependencies) here and as I said, Condor is working pretty well right from the boot on 3.0.8 as well. Only few of them are just misbehaving. This what I got in my condor_config file: RELEASE_DIR = /opt/condor LOCAL_DIR = /home/condorr LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local UID_DOMAIN = $(FULL_HOSTNAME) SOFT_UID_DOMAIN = TRUE FILESYSTEM_DOMAIN = $(FULL_HOSTNAME) And this the condor home directory: [root@farm016 /]# ll ~condorr total 20 lrwxrwxrwx 1 condorr root 45 Mar 8 11:13 condor_config.local -> /opt/condor/local.farm016/condor_config.local -rw-r--r-- 1 root root 1814 Mar 8 11:13 condor_config.local-ORG drwxrwxrwt 2 condorr root 4096 Mar 8 20:14 execute drwxr-xr-x 2 condorr root 4096 Mar 12 22:12 log drwxr-xr-x 2 condorr root 4096 Mar 8 11:13 spool Now I'm using condor.boot as advised with run level 3 & 5 on: [root@farm016 /]# chkconfig --list | grep condor condor 0:off 1:off 2:off 3:on 4:off 5:on 6:off But still after a reboot, no sign of condor: [santanu@baba1 ~]$ ssh root@xxxxxxxxxxxxxxxxxxxxxxxxx root@xxxxxxxxxxxxxxxxxxxxxxxxx's password: Last login: Mon Mar 12 22:28:50 2007 from host86-146-106-219.range86-146.btcentralplus.com [root@farm016 root]# ps -ef | grep condor root 2715 2562 0 23:12 pts/0 00:00:00 grep condor and I see these in the MasteLog: [root@farm016 root]# tail -f ~condorr/log/MasterLog 3/12 20:14:33 Using config source: /opt/condor/etc/condor_config 3/12 20:14:33 Using local config sources: 3/12 20:14:33 /home/condorr/condor_config.local 3/12 20:14:33 DaemonCore: Command Socket at <172.24.116.146:9692> 3/12 20:14:33 Started DaemonCore process "/opt/condor/sbin/condor_startd", pid and pgroup = 2731 3/12 21:14:33 Preen pid is 3000 3/12 21:14:33 Child 3000 died, but not a daemon -- Ignored 3/12 22:12:57 Got SIGTERM. Performing graceful shutdown. 3/12 22:12:57 SafeMsg: sending small msg failed. errno: 22 3/12 22:12:57 Send_Signal: ERROR sending signal 15 to pid 2731 3/12 22:12:57 ERROR: failed to send SIGTERM to pid 2731 3/12 22:12:57 The STARTD (pid 2731) exited with status 0 3/12 22:12:57 All daemons are gone. Exiting. 3/12 22:12:57 **** condor_master (condor_MASTER) EXITING WITH STATUS 0 But once I run condor_master by hand, it goes all well after that: [root@farm016 root]# condor_master [root@farm016 root]# ps -ef | grep condor condorr 2733 1 0 23:15 ? 00:00:00 condor_master condorr 2734 2733 23 23:15 ? 00:00:00 condor_startd -f root 2740 2562 0 23:15 pts/0 00:00:00 grep condor The boxes, those are giving trouble all are dual-core Xeon 5150 but I don't think that would be a problem. Rest of them are 2.8GHx Xeon and condor is perfectly okay on them with the same sort of configuration. I'm just confused. There must be something very small that's being overlooked. Thanks for helping, Santanu Andrey Kaliazin wrote: Hello Santanu Condor works perfectly well on Scientific Linux 4.4 - this is the version we have installed. You really should consider starting Condor using Derek's condor.boot file, which can be found in examples/ folder. I normally would copy it to /etc/init.d/condor and change the MASTER line in it. Then just add the service to the runlevels 3 and 5 either manually or via system-config-services (RedHat style, love it or hate it...) We also run condor daemons as user condor, but you do not have to export CONDOR_IDS for that - I have it undefined in the condor_config file. Instead, we have condor_config file, which contains this - CONDOR_ROOT = /hpc/condor RELEASE_DIR = $(CONDOR_ROOT)/releases/x86 LOCAL_DIR = $(TILDE) # Where is the machine-specific local config file for each host? LOCAL_CONFIG_FILE = $(CONDOR_ROOT)/hosts/$(HOSTNAME) REQUIRE_LOCAL_CONFIG_FILE = TRUE /hpc/condor - is shared out to all clients, while on each machine condor has its own home folder, with condor_config linked to the central one - # ll ~condor total 1 lrwxrwxrwx 1 condor root 25 Sep 13 12:44 condor_config -> /hpc/condor/condor_config drwxr-xr-x 2 condor root 48 Sep 13 12:47 execute drwxr-xr-x 2 condor root 552 Mar 12 16:40 log drwxr-xr-x 3 condor root 256 Mar 12 12:13 spool Each "local config file" sits in a shared folder on the server and is actually a symbolic link to one of just a few real config files, which reflects different architecture and setup. This way we have the central condor server a view server and many clients running happily Debian (Ubuntu), RedHat, ScientificLinux, SuSE and SLES in both i386 and 64-bit. On all of them condor master is started by pretty much the same /etc/init.d/condor file - the only difference is in MASTER, specifying the platform, 32 or 64 bit version of binaries. cheers, Andrey Kaliazin Senior Server Engineer (cluster computing) Information Systems Aston (ISA) Aston University, Aston Triangle, Birmingham, B4 7ET Tel: 0121 204 3465-----Original Message----- From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Santanu Das Sent: Monday, March 12, 2007 8:14 PM To: Condor-Users Mail List Subject: Re: [Condor-users] condor_master problem Hi Nicolas, Thanks for sharing this info but unfortunately still doesn't work on Scientific Linux. First of all, I never used condor.boot before. This time I tried (following the instruction written inside) and didn't work as well. I'm exporting CONDOR_CONFIG and PATH from /etc/profile.d/ but whatever I do, Condor just not starting until I run condor_master by hand. I must have missed some silly part(s). 1/3 of my nodes are okay, just newly installed nodes are driving me crazy. Cheers, Santanu Nicolas GUIOT wrote: You said it started well when you run them by yourself ? Maybe the PATH is not set when the daemon runs : that happens on my debian boxes. I have to add in the condor.boot file : export CONDOR_CONFIG=/nfs/condor/etc/condor_config PATH=/nfs/opt/condor/bin:/nfs/opt/condor/sbin MASTER=/nfs/opt/condor/sbin/condor_master PS="/bin/ps auwx" GREP="/bin/grep" AWK="/usr/bin/awk" Hope this helps... Nicolas ---------------- On Mon, 12 Mar 2007 16:22:44 +0000 Santanu Das <santanu@xxxxxxxxxxxxxxxxx> <mailto:santanu@xxxxxxxxxxxxxxxxx> wrote: Hi Steve, Thanks for replying. I tried that but didn't do quite well. Even if I delete the file or even I don't, running CONDOR_MASTER start condor nicely but still don't start automatically if I reboot. Anything else am I missing? Cheers, Santanu Steven Timm wrote: Remove that lock file in /tmp that is mentioned in the error message below, and condor will start. Steve ------------------------------------------------------------------ Steven C. Timm, Ph.D (630) 840-8525 timm@xxxxxxxx http://home.fnal.gov/~timm/ Fermilab Computing Division, Scientific Computing Facilities, Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader. On Sat, 10 Mar 2007, Santanu Das wrote: Hi, I'm still having the same problem - condor_master just doesn't start automatically after boot. Dose anybody know anything about it? Thanks in advance for your help. Cheers, Santanu Santanu Das wrote: Hi all, We have a ~150 CPU condor cluster; most of them are dual core Xeon and few of them are with single core Xeon. Recently I upgraded to condor-6.8.4 and since then I see a problem, mostly on the all dual core nodes. I start condor from the "rc.local" and the problem I see now Condor is not starting automatically on boot, in spite of having "condor_master" in the rc.local file. If I run condor_master by hand from the console, condor starts and every thing goes fine after that. For some reason, I run condor here as a different user (*NOT* as default "condor" user), but don't think that's the problem. CONDOR_IDS is correct in the local config file. There are no such significant difference (from the configuration point of view) among the nodes; all are almost identically configured (apart from that dual-core and single-core issue). I just see these in the MasterLog: 3/8 17:56:03 ****************************************************** 3/8 17:56:03 ** condor_master (CONDOR_MASTER) STARTING UP 3/8 17:56:03 ** /opt/condor-6.8.4/sbin/condor_master 3/8 17:56:03 ** $CondorVersion: 6.8.4 Feb 1 2007 $ 3/8 17:56:03 ** $CondorPlatform: I386-LINUX_RH9 $ 3/8 17:56:03 ** PID = 3216 3/8 17:56:03 ** Log last touched 3/8 17:56:02 3/8 17:56:03 ****************************************************** 3/8 17:56:03 Using config source: /opt/condor/etc/condor_config 3/8 17:56:03 Using local config sources: 3/8 17:56:03 /home/condorr/condor_config.local 3/8 17:56:03 FileLock::obtain(1) failed - errno 11 (Resource temporarily unavailable) 3/8 17:56:03 ERROR "Can't get lock on "/tmp/condor-lock.farm0420.21308906360446/InstanceLock"" at line 978 in file master.C 3/8 18:08:57 Got SIGTERM. Performing graceful shutdown. 3/8 18:08:57 SafeMsg: sending small msg failed. errno: 22 3/8 18:08:57 Send_Signal: ERROR sending signal 15 to pid 3181 3/8 18:08:57 ERROR: failed to send SIGTERM to pid 3181 3/8 18:08:57 The STARTD (pid 3181) exited with status 0 3/8 18:08:57 All daemons are gone. Exiting. 3/8 18:08:57 **** condor_master (condor_MASTER) EXITING WITH STATUS 0 3/8 18:12:11 passwd_cache::cache_uid(): getpwnam("condor") failed: Success 3/8 18:12:11 passwd_cache::cache_uid(): getpwnam("condor") failed: Success Any idea what might be the problem or what am I missing? Cheers, Santanu HEP, Cavendish Laboratory Cambridge _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at either https://lists.cs.wisc.edu/archive/condor-users/ http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at either https://lists.cs.wisc.edu/archive/condor-users/ http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR ---------- ---------------------------------------------------- CNRS - UPR 9080 : Laboratoire de Biochimie Theorique Institut de Biologie Physico-Chimique 13 rue Pierre et Marie Curie 75005 PARIS - FRANCE Tel : +33 158 41 51 70 Fax : +33 158 41 50 26 ---------------------------------------------------- _______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at either https://lists.cs.wisc.edu/archive/condor-users/ http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at either https://lists.cs.wisc.edu/archive/condor-users/ http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR |