Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Trying to run condor_glidein on the National Grid Service
- Date: Thu, 16 Oct 2008 21:47:42 +0100
- From: Jean-Alain Grunchec <jgrunche@xxxxxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Trying to run condor_glidein on the National Grid Service
Hi Dan,
Yes that's the right address (I substituted the real address with
129.130.131.132). Yes, there is a firewall on the HPC.
I assume that may be the reason why the connection cannot be established.
Basically, I know the firewall of the HPC allows to connect outside on
port 80. If I was to run the collector on port 80, would that be OK ?
(Some HPC on the NGS only allow connections through 443, so I may need
to redirect connections if I was doing something like that... )
Currently, I am trying to set up the GCB, but I have issues with it.
I added GCB_BROKER to the daemon list in
/home/condor/condor_config.local (DAEMON_LIST = MASTER, COLLECTOR,
NEGOTIATOR, STARTD, SCHEDD, GCB_BROKER) .
I also appended the following lines to /home/condor/condor_config.local
GCB_BROKER = $(RELEASE_DIR)/libexec/gcb_broker
GCB_RELAY = $(RELEASE_DIR)/libexec/gcb_relay_server
GCB_BROKER_ENV =
GCB_BROKER_ENV = $(GCB_BROKER_ENV);GCB_RELAY_SERVER=$(GCB_RELAY)
GCB_BROKER_ENV = $(GCB_BROKER_ENV);GCB_LOG_DIR=$(LOG)
GCB_BROKER_ENVIRONMENT = $(GCB_BROKER_ENV)
GCB_BROKER_IP = $(ip_address)
GCB_BROKER_ARGS = -i $(GCB_BROKER_IP)
NET_REMAP_ENABLE = true
NET_REMAP_SERVICE = GCB
NET_REMAP_INAGENT = 129.130.131.132
NET_REMAP_ROUTE = /home/condor/condor_routetable.txt
BIND_ALL_INTERFACES = true
I also wrote a "route table" :
[me@mycomputer ~]$ cat /home/condor/condor_routetable.txt
129.11.27.0/24 GCB
*/0 direct
But immediately after I started Condor, I read some warnings in the log
files (especially in the CollectorLog ) and errors in SchedLog and StartLog:
MasterLog:
10/16 21:28:54 ******************************************************
10/16 21:28:54 ** condor_master (CONDOR_MASTER) STARTING UP
10/16 21:28:54 ** /opt/condor-release-6.8.8/sbin/condor_master
10/16 21:28:54 ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 ** PID = 5221
10/16 21:28:54 ** Log last touched time unavailable (No such file or
directory)
10/16 21:28:54 ******************************************************
10/16 21:28:54 Using config source: /home/condor/condor_config
10/16 21:28:54 Using local config sources:
10/16 21:28:54 /home/condor/condor_config.local
10/16 21:28:54 DaemonCore: Command Socket at <129.130.131.132:9620>
10/16 21:28:54 Log file not found in config file: GCB_BROKER_LOG
10/16 21:28:54 Started DaemonCore process
"/opt/condor-release-6.8.8/sbin/condor_collector", pid and pgroup = 5222
10/16 21:28:54 Started DaemonCore process
"/opt/condor-release-6.8.8/sbin/condor_negotiator", pid and pgroup = 5223
10/16 21:28:54 Started DaemonCore process
"/opt/condor-release-6.8.8/sbin/condor_startd", pid and pgroup = 5224
10/16 21:28:54 Started DaemonCore process
"/opt/condor-release-6.8.8/sbin/condor_schedd", pid and pgroup = 5225
10/16 21:28:54 Started process
"/opt/condor-release-6.8.8/libexec/gcb_broker", pid and pgroup = 5226
[jgrunche@epistasis ~]$ cat /home/condor/log/StartLog
10/16 21:28:54 ******************************************************
10/16 21:28:54 ** condor_startd (CONDOR_STARTD) STARTING UP
10/16 21:28:54 ** /opt/condor-release-6.8.8/sbin/condor_startd
10/16 21:28:54 ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 ** PID = 5224
10/16 21:28:54 ** Log last touched time unavailable (No such file or
directory)
10/16 21:28:54 ******************************************************
10/16 21:28:54 Using config source: /home/condor/condor_config
10/16 21:28:54 Using local config sources:
10/16 21:28:54 /home/condor/condor_config.local
10/16 21:28:54 DaemonCore: Command Socket at <129.130.131.132:9644>
10/16 21:28:55 vm1: New machine resource allocated
10/16 21:28:55 vm2: New machine resource allocated
10/16 21:28:55 About to run initial benchmarks.
10/16 21:29:02 Completed initial benchmarks.
10/16 21:29:02 vm1: State change: IS_OWNER is false
10/16 21:29:02 vm1: Changing state: Owner -> Unclaimed
10/16 21:29:02 vm2: State change: IS_OWNER is false
10/16 21:29:02 vm2: Changing state: Owner -> Unclaimed
10/16 21:29:02 GCB: ERROR "GCB_bind: binding the socket locally failed"
errno 98: Address already in use
10/16 21:29:07 GCB: ERROR "GCB_bind: binding the socket locally failed"
errno 98: Address already in use
[jgrunche@epistasis ~]$ cat /home/condor/log/SchedLog
10/16 21:28:54 (pid:5225)
******************************************************
10/16 21:28:54 (pid:5225) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
10/16 21:28:54 (pid:5225) ** /opt/condor-release-6.8.8/sbin/condor_schedd
10/16 21:28:54 (pid:5225) ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 (pid:5225) ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 (pid:5225) ** PID = 5225
10/16 21:28:54 (pid:5225) ** Log last touched time unavailable (No such
file or directory)
10/16 21:28:54 (pid:5225)
******************************************************
10/16 21:28:54 (pid:5225) Using config source: /home/condor/condor_config
10/16 21:28:54 (pid:5225) Using local config sources:
10/16 21:28:54 (pid:5225) /home/condor/condor_config.local
10/16 21:28:54 (pid:5225) DaemonCore: Command Socket at
<129.130.131.132:9623>
10/16 21:28:54 (pid:5225) History file rotation is enabled.
10/16 21:28:54 (pid:5225) Maximum history file size is: 20971520 bytes
10/16 21:28:54 (pid:5225) Number of rotated history files is: 2
10/16 21:28:54 (pid:5225) Sent ad to central manager for
me@xxxxxxxxxxxxxxxxxxx
10/16 21:28:54 (pid:5225) Sent ad to 1 collectors for me@xxxxxxxxxxxxxxxxxxx
10/16 21:28:54 (pid:5225) After chmod(), still can't remove
"/tmp/condor_g_scratch.0x9931278.4435" as directory owner, giving up!
10/16 21:28:54 (pid:5225) Started condor_gmanager for owner me pid=5239
10/16 21:30:49 (pid:5225) condor_gridmanager (PID 5239, owner me) exited
with return code 0.
10/16 21:33:54 (pid:5225) GCB: ERROR "GCB_bind: binding the socket
locally failed" errno 98: Address already in use
10/16 21:33:54 (pid:5225) Sent owner (0 jobs) ad to 1 collectors
[me@mycomputer ~]$ cat /home/condor/log/CollectorLog
10/16 21:28:54 ******************************************************
10/16 21:28:54 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
10/16 21:28:54 ** /opt/condor-release-6.8.8/sbin/condor_collector
10/16 21:28:54 ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 ** PID = 5222
10/16 21:28:54 ** Log last touched time unavailable (No such file or
directory)
10/16 21:28:54 ******************************************************
10/16 21:28:54 Using config source: /home/condor/condor_config
10/16 21:28:54 Using local config sources:
10/16 21:28:54 /home/condor/condor_config.local
10/16 21:28:54 DaemonCore: Command Socket at <129.130.131.132:9618>
10/16 21:28:54 In ViewServer::Init()
10/16 21:28:54 In CollectorDaemon::Init()
10/16 21:28:54 In ViewServer::Config()
10/16 21:28:54 In CollectorDaemon::Config()
10/16 21:28:54 enable: Creating stats hash table
10/16 21:28:54 (Sending 0 ads in response to query)
10/16 21:28:54 Got QUERY_STARTD_PVT_ADS
10/16 21:28:54 (Sending 0 ads in response to query)
10/16 21:28:54 NegotiatorAd : Inserting ** "< mycomputer.ed.ac.uk >"
10/16 21:28:54 stats: Inserting new hashent for
'Negotiator':'mycomputer.ed.ac.uk':'129.130.131.132'
10/16 21:28:54 WARNING: No master ad for < mycomputer.ed.ac.uk >
10/16 21:28:54 ScheddAd : Inserting ** "< mycomputer.ed.ac.uk ,
129.130.131.132 >"
10/16 21:28:54 stats: Inserting new hashent for
'Schedd':'mycomputer.ed.ac.uk':'129.130.131.132'
10/16 21:28:54 SubmittorAd : Inserting ** "<
me@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 129.130.131.132 >"
10/16 21:28:54 stats: Inserting new hashent for
'Submittor':'me@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'
10/16 21:28:59 ** Master < mycomputer.ed.ac.uk > rejuvenated from
recently down
10/16 21:28:59 stats: Inserting new hashent for
'Master':'mycomputer.ed.ac.uk':'129.130.131.132'
10/16 21:29:06 WARNING: No master ad for < vm1@xxxxxxxxxxxxxxxxxxx >
10/16 21:29:06 StartdAd : Inserting ** "< vm1@xxxxxxxxxxxxxxxxxxx ,
129.130.131.132 >"
10/16 21:29:06 stats: Inserting new hashent for
'Start':'vm1@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'
10/16 21:29:06 StartdPvtAd : Inserting ** "< vm1@xxxxxxxxxxxxxxxxxxx ,
129.130.131.132 >"
10/16 21:29:06 stats: Inserting new hashent for
'StartdPvt':'vm1@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'
10/16 21:29:07 WARNING: No master ad for < vm2@xxxxxxxxxxxxxxxxxxx >
10/16 21:29:07 StartdAd : Inserting ** "< vm2@xxxxxxxxxxxxxxxxxxx ,
129.130.131.132 >"
10/16 21:29:07 stats: Inserting new hashent for
'Start':'vm2@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'
10/16 21:29:07 StartdPvtAd : Inserting ** "< vm2@xxxxxxxxxxxxxxxxxxx ,
129.130.131.132 >"
10/16 21:29:07 stats: Inserting new hashent for
'StartdPvt':'vm2@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'
[me@mycomputer ~]$ cat /home/condor/log/GCB_BrokerLog
10/16 21:28:54 ****************************************
10/16 21:28:54 New log file started
10/16 21:28:54 Max size = 640000
10/16 21:28:54 Log level: D_BASIC
10/16 21:28:54 ****************************************
10/16 21:28:54 [broker.C:199] ++++++++++++++++++++++++++++++
10/16 21:28:54 [broker.C:200] + STARTING Broker (pid: 5226)
10/16 21:28:54 [broker.C:201] + $GCBVersion: 1.3.2 $
10/16 21:28:54 [broker.C:202] + $GCBBuildDate: Dec 19 2007 $
10/16 21:28:54 [broker.C:255] + Listening at 129.130.131.132:65432
10/16 21:28:54 [broker.C:275] + Using relay_server:
/opt/condor-release-6.8.8/libexec/gcb_relay_server
10/16 21:28:54 [broker.C:276] ++++++++++++++++++++++++++++++
Thank you for your help,
Jean-Alain
Jean-Alain,
It sounds like there are two problems when your glideins try to run:
10/15 22:19:10 Can't connect to <129.130.131.132:9618>:0, errno = 113
Is that the address of the collector to which the glideins should be
advertising themselves? Are there any firewalls or anything that would
prevent them from connecting?
10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
This confusing line in the logs should be ignored. It is no longer
produced in the 7.0 or 7.1 series of condor.
10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file ResMgr.C
I don't know what is going wrong. Do you have any special SMP-related
configuration in your glidein configuration? Normally, one configures
glideins with NUM_CPUS=1 to force each instance of glidein to only
advertise a single slot, rather than one per cpu on the machine.
--Dan
Jean-Alain Grunchec wrote:
Hi,
I am trying to run GlideIn jobs on the UK National Grid Service. I set
up a local machine as a Condor Central Manager. I put Globus-Toolkit 4
on it.
Now I try to submit GlideIn jobs on a HPC in Leeds (the ultimate idea
being the submission of many GlideIn jobs to several NGS resources).
So I start the following command, which starts something at least in Leeds.
Ideally, at this point I would have 10 new machines added to my Condor
pool, so I check
[me@mycomputer test]$ condor_glidein -count 10 -arch
6.6.7-i686-pc-Linux-2.4 -setup_jobmanager jobmanager-fork
ngs.leeds.ac.uk/jobmanager-pbs
Running/verifying Glidein installation and setup...
Submitting Glidein setup job...
Installing /home/ngs0123/Condor_glidein/glidein_condor_config.
Installing
/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/glidein_startup.
Installing Condor daemons in
/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.
Downloaded
http://www.cs.wisc.edu/condor/glidein/binaries/6.6.7-i686-pc-Linux-2.4.tar.gz
to /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.
Installation successfully completed.
Launching Glidein...
Submitting Glidein job...
Submitting job(s).
1 job(s) submitted to cluster 5.
You have new mail in /var/spool/mail/me
Ideally, at this point I would have 10 new machine added to my Condor
pool, so I check, but there is no new machine there. I read the email sent :
Date: Wed, 15 Oct 2008 22:14:19 +0100
From: Me <me@xxxxxxxxxxxxxxxxxxx>
Message-Id: <200810152114.m9FLEJAu003357@xxxxxxxxxxxxxxxxxxx>
To: me@xxxxxxxxxxxxxxxxxxx
Subject: [Condor] Condor Job 4.0
This is an automated email from the Condor system
on machine "mycomputer.ed.ac.uk". Do not reply.
Your Condor job 4.0
/home/me/test/glidein_remote_setup.3117 $(HOME)/Condor_glidein
$(HOME)/Condor_glidein/6.6.7-i686-pc-Linux-2.4 6.6.7-i686-pc-Linux-2.4
$(HOME)/Condor_glidein/local
'http://www.cs.wisc.edu/condor/glidein/binaries,gsiftp://gridftp.cs.wisc.edu/p/condor/public/binaries/glidein'
0
has exited.
Submitted at: Wed Oct 15 22:10:55 2008
Completed at: Wed Oct 15 22:14:19 2008
Real Time: 0 00:03:24
Something has run somehow, but I am not sure GlideIn jobs really ran OK.
So I try to see on the headnode in Leeds if there are some temporary
files left, and yes, there are a few.
10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_master (CONDOR_MASTER) STARTING UP
10/15 22:19:05 **
/nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_master
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20178
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file:
/home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:53102>
10/15 22:19:05 Started DaemonCore process
"/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd",
pid and pgroup = 20179
10/15 22:19:07 The STARTD (pid 20179) exited with status 4
10/15 22:19:07 Sending obituary for
"/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd"
10/15 22:19:07 restarting
/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd in 10
seconds
10/15 22:19:10 Can't connect to <129.130.131.132:9618>:0, errno = 113
10/15 22:19:10 Will keep trying for 10 seconds...
10/15 22:19:19 Connect failed for 10 seconds; returning FALSE
10/15 22:19:19 ERROR:
SECMAN:2003:TCP connection to <129.130.131.132:9618> failed
Here apparently there is a connection to the server issue.
I read at the CONDOR_STARTD in Leeds and it is even more bizarre.
[ngs0123@ngs log.10.141.0.9-20178]$ cat StartdLog
10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_startd (CONDOR_STARTD) STARTING UP
10/15 22:19:05 **
/nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20179
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file:
/home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:48585>
10/15 22:19:07 ERROR: Can't allocate 5th virtual machine of type 0
Requesting: Cpus: 1, Memory: 1, Swap: 10.00%, Disk: 10.00%
Available: Cpus: 6, Memory: 0, Swap: 60.00%, Disk: 60.00%
10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file
ResMgr.C
10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
Has anybody managed to use glideIn on the NGS ? Alternatively, if
somebody has used glideIn on another Grid, your experience may help me.
Thank you very much,
J-A
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.