[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] condor_master problem



Hi Andrey,

My setup is not too different from yours; I'm just using SL-3.0.8 (due to gLite middleware dependencies) here and as I said, Condor is working pretty well right from the boot on 3.0.8 as well. Only few of them are just misbehaving. This what I got in my condor_config file:

RELEASE_DIR             = /opt/condor
LOCAL_DIR               = /home/condorr
LOCAL_CONFIG_FILE       = $(LOCAL_DIR)/condor_config.local
UID_DOMAIN              = $(FULL_HOSTNAME)
SOFT_UID_DOMAIN         = TRUE
FILESYSTEM_DOMAIN       = $(FULL_HOSTNAME)

And this the condor home directory:

[root@farm016 /]# ll ~condorr
total 20
lrwxrwxrwx    1 condorr  root           45 Mar  8 11:13 condor_config.local -> /opt/condor/local.farm016/condor_config.local
-rw-r--r--    1 root     root         1814 Mar  8 11:13 condor_config.local-ORG
drwxrwxrwt    2 condorr  root         4096 Mar  8 20:14 execute
drwxr-xr-x    2 condorr  root         4096 Mar 12 22:12 log
drwxr-xr-x    2 condorr  root         4096 Mar  8 11:13 spool

Now I'm using condor.boot as advised with run level 3 & 5 on:

[root@farm016 /]# chkconfig --list | grep condor
condor          0:off   1:off   2:off   3:on    4:off   5:on    6:off

But still after a reboot, no sign of condor:

[santanu@baba1 ~]$ ssh root@xxxxxxxxxxxxxxxxxxxxxxxxx
root@xxxxxxxxxxxxxxxxxxxxxxxxx's password:
Last login: Mon Mar 12 22:28:50 2007 from host86-146-106-219.range86-146.btcentralplus.com
[root@farm016 root]# ps -ef | grep condor
root      2715  2562  0 23:12 pts/0    00:00:00 grep condor


and I see these in the MasteLog:

[root@farm016 root]# tail -f ~condorr/log/MasterLog
3/12 20:14:33 Using config source: /opt/condor/etc/condor_config
3/12 20:14:33 Using local config sources:
3/12 20:14:33    /home/condorr/condor_config.local
3/12 20:14:33 DaemonCore: Command Socket at <172.24.116.146:9692>
3/12 20:14:33 Started DaemonCore process "/opt/condor/sbin/condor_startd", pid and pgroup = 2731
3/12 21:14:33 Preen pid is 3000
3/12 21:14:33 Child 3000 died, but not a daemon -- Ignored
3/12 22:12:57 Got SIGTERM. Performing graceful shutdown.
3/12 22:12:57 SafeMsg: sending small msg failed. errno: 22
3/12 22:12:57 Send_Signal: ERROR sending signal 15 to pid 2731
3/12 22:12:57 ERROR: failed to send SIGTERM to pid 2731
3/12 22:12:57 The STARTD (pid 2731) exited with status 0
3/12 22:12:57 All daemons are gone.  Exiting.
3/12 22:12:57 **** condor_master (condor_MASTER) EXITING WITH STATUS 0


But once I run condor_master by hand, it goes all well after that:

[root@farm016 root]# condor_master
[root@farm016 root]# ps -ef | grep condor
condorr   2733     1  0 23:15 ?        00:00:00 condor_master
condorr   2734  2733 23 23:15 ?        00:00:00 condor_startd -f
root      2740  2562  0 23:15 pts/0    00:00:00 grep condor


The boxes, those are giving trouble all are dual-core Xeon 5150 but I don't think that would be a problem. Rest of them are 2.8GHx Xeon and condor is perfectly okay on them with the same sort of configuration. I'm just confused. There must be something very small that's being overlooked.

Thanks for helping,
Santanu



Andrey Kaliazin wrote:
Hello Santanu

Condor works perfectly well on Scientific Linux 4.4 - this is the
version we have installed. 
You really should consider starting Condor using Derek's condor.boot
file, which can be found in examples/ folder. I normally would copy
it to /etc/init.d/condor and change the MASTER line in it.
Then just add the service to the runlevels 3 and 5 either manually 
or via system-config-services (RedHat style, love it or hate it...)

We also run condor daemons as user condor, but you do not have to 
export CONDOR_IDS for that - I have it undefined in the condor_config file.
Instead, we have condor_config file, which contains this -

CONDOR_ROOT             = /hpc/condor
RELEASE_DIR             = $(CONDOR_ROOT)/releases/x86
LOCAL_DIR               = $(TILDE)
#  Where is the machine-specific local config file for each host?
LOCAL_CONFIG_FILE       = $(CONDOR_ROOT)/hosts/$(HOSTNAME)
REQUIRE_LOCAL_CONFIG_FILE = TRUE

/hpc/condor - is shared out to all clients, while on each machine
condor has its own home folder, with condor_config linked to the
central one -
# ll ~condor
total 1
lrwxrwxrwx 1 condor root  25 Sep 13 12:44 condor_config ->
/hpc/condor/condor_config
drwxr-xr-x 2 condor root  48 Sep 13 12:47 execute
drwxr-xr-x 2 condor root 552 Mar 12 16:40 log
drwxr-xr-x 3 condor root 256 Mar 12 12:13 spool

Each "local config file" sits in a shared folder on the server and is
actually a symbolic link to one of just a few real config files, 
which reflects different architecture and setup.
This way we have the central condor server a view server and many clients
running happily Debian (Ubuntu), RedHat, ScientificLinux, SuSE and SLES
in both i386 and 64-bit. On all of them condor master is started by pretty
much
the same /etc/init.d/condor file - the only difference is in MASTER,
specifying the platform, 32 or 64 bit version of binaries.

cheers,

Andrey Kaliazin
Senior Server Engineer (cluster computing)
Information Systems Aston (ISA)
Aston University, Aston Triangle,
Birmingham, B4 7ET 
Tel: 0121 204 3465 
 

  
-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx 
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Santanu Das
Sent: Monday, March 12, 2007 8:14 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] condor_master problem

Hi Nicolas,

Thanks for sharing this info but unfortunately still doesn't 
work on Scientific Linux. First of all, I never used 
condor.boot before. This time I tried (following the 
instruction written inside) and didn't work as well. I'm 
exporting CONDOR_CONFIG and PATH from  /etc/profile.d/ but 
whatever I do, Condor just not starting until I run 
condor_master by hand. I must have missed some silly part(s). 
1/3 of my nodes are okay, just newly installed nodes are 
driving me crazy.

Cheers,
Santanu


Nicolas GUIOT wrote: 

	You said it started well when you run them by yourself 
? Maybe the PATH is not set when the daemon runs : that 
happens on my debian boxes. I have to add in the condor.boot file : 
	
	export CONDOR_CONFIG=/nfs/condor/etc/condor_config
	PATH=/nfs/opt/condor/bin:/nfs/opt/condor/sbin
	
	MASTER=/nfs/opt/condor/sbin/condor_master
	PS="/bin/ps auwx"
	GREP="/bin/grep"
	AWK="/usr/bin/awk"
	
	
	Hope this helps...
	Nicolas
	
	----------------
	On Mon, 12 Mar 2007 16:22:44 +0000
	Santanu Das <santanu@xxxxxxxxxxxxxxxxx> 
<mailto:santanu@xxxxxxxxxxxxxxxxx>  wrote:
	
	  

		Hi Steve,
		
		Thanks for replying. I tried that but didn't do 
quite well. Even if I 
		delete the file or even I don't, running 
CONDOR_MASTER start condor 
		nicely but still don't start automatically if I 
reboot. Anything else am 
		I missing?
		
		Cheers,
		Santanu
		
		
		Steven Timm wrote:
		    

			Remove that lock file in /tmp that is 
mentioned in the error message
			below, and condor will start.
			
			Steve
			
			
			
------------------------------------------------------------------
			Steven C. Timm, Ph.D  (630) 840-8525
			timm@xxxxxxxx  http://home.fnal.gov/~timm/
			Fermilab Computing Division, Scientific 
Computing Facilities,
			Grid Facilities Department, FermiGrid 
Services Group, Assistant Group Leader.
			
			On Sat, 10 Mar 2007, Santanu Das wrote:
			
			  
			      

				Hi,
				I'm still having the same 
problem - condor_master just doesn't start
				automatically after boot. Dose 
anybody know anything about it? Thanks in
				advance for your help.
				
				Cheers,
				Santanu
				
				Santanu Das wrote:
				    
				        

					Hi all,
					
					We have a ~150 CPU 
condor cluster; most of them are dual core Xeon and
					few of them are with 
single core Xeon. Recently I upgraded to
					condor-6.8.4 and since 
then I see a problem, mostly on the all dual
					core nodes. I start 
condor from the "rc.local" and the problem I see
					now Condor is not 
starting automatically on boot, in spite of having
					"condor_master" in the 
rc.local file. If  I run condor_master by hand
					from the console, 
condor starts and every thing goes fine after that.
					For some reason, I run 
condor here as a different user (*NOT* as
					default "condor" user), 
but don't think that's the problem.
					CONDOR_IDS is correct 
in the local config file. There are no such
					significant difference  
(from the configuration point of view) among
					the nodes; all are 
almost identically configured (apart from that
					dual-core and 
single-core issue). I just see these in the MasterLog:
					
					3/8 17:56:03 
******************************************************
					3/8 17:56:03 ** 
condor_master (CONDOR_MASTER) STARTING UP
					3/8 17:56:03 ** 
/opt/condor-6.8.4/sbin/condor_master
					3/8 17:56:03 ** 
$CondorVersion: 6.8.4 Feb  1 2007 $
					3/8 17:56:03 ** 
$CondorPlatform: I386-LINUX_RH9 $
					3/8 17:56:03 ** PID = 3216
					3/8 17:56:03 ** Log 
last touched 3/8 17:56:02
					3/8 17:56:03 
******************************************************
					3/8 17:56:03 Using 
config source: /opt/condor/etc/condor_config
					3/8 17:56:03 Using 
local config sources:
					3/8 17:56:03    
/home/condorr/condor_config.local
					3/8 17:56:03 
FileLock::obtain(1) failed - errno 11 (Resource
					temporarily unavailable)
					3/8 17:56:03 ERROR 
"Can't get lock on
					
"/tmp/condor-lock.farm0420.21308906360446/InstanceLock"" at line 978
					in file master.C
					3/8 18:08:57 Got 
SIGTERM. Performing graceful shutdown.
					3/8 18:08:57 SafeMsg: 
sending small msg failed. errno: 22
					3/8 18:08:57 
Send_Signal: ERROR sending signal 15 to pid 3181
					3/8 18:08:57 ERROR: 
failed to send SIGTERM to pid 3181
					3/8 18:08:57 The STARTD 
(pid 3181) exited with status 0
					3/8 18:08:57 All 
daemons are gone.  Exiting.
					3/8 18:08:57 **** 
condor_master (condor_MASTER) EXITING WITH STATUS 0
					3/8 18:12:11 
passwd_cache::cache_uid(): getpwnam("condor") failed:
					Success
					
					3/8 18:12:11 
passwd_cache::cache_uid(): getpwnam("condor") failed:
					Success
					
					Any idea what might be 
the problem or what am I missing?
					
					Cheers,
					Santanu
					HEP, Cavendish Laboratory
					Cambridge
					
					      
					          

				
_______________________________________________
				Condor-users mailing list
				To unsubscribe, send a message 
to condor-users-request@xxxxxxxxxxx with a
				subject: Unsubscribe
				You can also unsubscribe by visiting
				
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
				
				The archives can be found at either
				
https://lists.cs.wisc.edu/archive/condor-users/
				
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
				
				    
				        

			_______________________________________________
			Condor-users mailing list
			To unsubscribe, send a message to 
condor-users-request@xxxxxxxxxxx with a
			subject: Unsubscribe
			You can also unsubscribe by visiting
			
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
			
			The archives can be found at either
			https://lists.cs.wisc.edu/archive/condor-users/
			
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
			  
			      

	
	----------
	
	
	----------------------------------------------------
	CNRS - UPR 9080 : Laboratoire de Biochimie Theorique
	
	Institut de Biologie Physico-Chimique
	13 rue Pierre et Marie Curie
	75005 PARIS - FRANCE
	
	Tel : +33 158 41 51 70
	Fax : +33 158 41 50 26
	----------------------------------------------------
	_______________________________________________
	Condor-users mailing list
	To unsubscribe, send a message to 
condor-users-request@xxxxxxxxxxx with a
	subject: Unsubscribe
	You can also unsubscribe by visiting
	https://lists.cs.wisc.edu/mailman/listinfo/condor-users
	
	The archives can be found at either
	https://lists.cs.wisc.edu/archive/condor-users/
	
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR
	  



    

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR