Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trying to run condor_glidein on the National Grid Service

Date: Thu, 16 Oct 2008 21:47:42 +0100
From: Jean-Alain Grunchec <jgrunche@xxxxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Trying to run condor_glidein on the National Grid Service

Hi Dan,

Yes that's the right address (I substituted the real address with129.130.131.132). Yes, there is a firewall on the HPC.


I assume that may be the reason why the connection cannot be established.

Basically, I know the firewall of the HPC allows to connect outside onport 80. If I was to run the collector on port 80, would that be OK ?(Some HPC on the NGS only allow connections through 443, so I may needto redirect connections if I was doing something like that... )


Currently, I am trying to set up the GCB, but I have issues with it.

I added GCB_BROKER to the daemon list in/home/condor/condor_config.local (DAEMON_LIST = MASTER, COLLECTOR,NEGOTIATOR, STARTD, SCHEDD, GCB_BROKER) .

I also appended the following lines to /home/condor/condor_config.local


GCB_BROKER = $(RELEASE_DIR)/libexec/gcb_broker
GCB_RELAY = $(RELEASE_DIR)/libexec/gcb_relay_server
GCB_BROKER_ENV =
GCB_BROKER_ENV = $(GCB_BROKER_ENV);GCB_RELAY_SERVER=$(GCB_RELAY)
GCB_BROKER_ENV = $(GCB_BROKER_ENV);GCB_LOG_DIR=$(LOG)
GCB_BROKER_ENVIRONMENT = $(GCB_BROKER_ENV)
GCB_BROKER_IP = $(ip_address)
GCB_BROKER_ARGS = -i $(GCB_BROKER_IP)
NET_REMAP_ENABLE = true
NET_REMAP_SERVICE = GCB
NET_REMAP_INAGENT = 129.130.131.132
NET_REMAP_ROUTE = /home/condor/condor_routetable.txt
BIND_ALL_INTERFACES = true

I also wrote a "route table" :

[me@mycomputer ~]$ cat /home/condor/condor_routetable.txt
129.11.27.0/24 GCB
*/0 direct

But immediately after I started Condor, I read some warnings in the logfiles (especially in the CollectorLog ) and errors in SchedLog and StartLog:


MasterLog:
10/16 21:28:54 ******************************************************
10/16 21:28:54 ** condor_master (CONDOR_MASTER) STARTING UP
10/16 21:28:54 ** /opt/condor-release-6.8.8/sbin/condor_master
10/16 21:28:54 ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 ** PID = 5221

10/16 21:28:54 ** Log last touched time unavailable (No such file ordirectory)

10/16 21:28:54 ******************************************************
10/16 21:28:54 Using config source: /home/condor/condor_config
10/16 21:28:54 Using local config sources:
10/16 21:28:54    /home/condor/condor_config.local
10/16 21:28:54 DaemonCore: Command Socket at <129.130.131.132:9620>
10/16 21:28:54 Log file not found in config file: GCB_BROKER_LOG

10/16 21:28:54 Started DaemonCore process"/opt/condor-release-6.8.8/sbin/condor_collector", pid and pgroup = 522210/16 21:28:54 Started DaemonCore process"/opt/condor-release-6.8.8/sbin/condor_negotiator", pid and pgroup = 522310/16 21:28:54 Started DaemonCore process"/opt/condor-release-6.8.8/sbin/condor_startd", pid and pgroup = 522410/16 21:28:54 Started DaemonCore process"/opt/condor-release-6.8.8/sbin/condor_schedd", pid and pgroup = 522510/16 21:28:54 Started process"/opt/condor-release-6.8.8/libexec/gcb_broker", pid and pgroup = 5226



[jgrunche@epistasis ~]$ cat /home/condor/log/StartLog
10/16 21:28:54 ******************************************************
10/16 21:28:54 ** condor_startd (CONDOR_STARTD) STARTING UP
10/16 21:28:54 ** /opt/condor-release-6.8.8/sbin/condor_startd
10/16 21:28:54 ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 ** PID = 5224

10/16 21:28:54 ** Log last touched time unavailable (No such file ordirectory)

10/16 21:28:54 ******************************************************
10/16 21:28:54 Using config source: /home/condor/condor_config
10/16 21:28:54 Using local config sources:
10/16 21:28:54    /home/condor/condor_config.local
10/16 21:28:54 DaemonCore: Command Socket at <129.130.131.132:9644>
10/16 21:28:55 vm1: New machine resource allocated
10/16 21:28:55 vm2: New machine resource allocated
10/16 21:28:55 About to run initial benchmarks.
10/16 21:29:02 Completed initial benchmarks.
10/16 21:29:02 vm1: State change: IS_OWNER is false
10/16 21:29:02 vm1: Changing state: Owner -> Unclaimed
10/16 21:29:02 vm2: State change: IS_OWNER is false
10/16 21:29:02 vm2: Changing state: Owner -> Unclaimed

10/16 21:29:02 GCB: ERROR "GCB_bind: binding the socket locally failed"errno 98: Address already in use10/16 21:29:07 GCB: ERROR "GCB_bind: binding the socket locally failed"errno 98: Address already in use




[jgrunche@epistasis ~]$ cat /home/condor/log/SchedLog

10/16 21:28:54 (pid:5225)******************************************************

10/16 21:28:54 (pid:5225) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
10/16 21:28:54 (pid:5225) ** /opt/condor-release-6.8.8/sbin/condor_schedd
10/16 21:28:54 (pid:5225) ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 (pid:5225) ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 (pid:5225) ** PID = 5225

10/16 21:28:54 (pid:5225) ** Log last touched time unavailable (No suchfile or directory)10/16 21:28:54 (pid:5225)******************************************************

10/16 21:28:54 (pid:5225) Using config source: /home/condor/condor_config
10/16 21:28:54 (pid:5225) Using local config sources:
10/16 21:28:54 (pid:5225)    /home/condor/condor_config.local

10/16 21:28:54 (pid:5225) DaemonCore: Command Socket at<129.130.131.132:9623>

10/16 21:28:54 (pid:5225) History file rotation is enabled.
10/16 21:28:54 (pid:5225)   Maximum history file size is: 20971520 bytes
10/16 21:28:54 (pid:5225)   Number of rotated history files is: 2

10/16 21:28:54 (pid:5225) Sent ad to central manager forme@xxxxxxxxxxxxxxxxxxx

10/16 21:28:54 (pid:5225) Sent ad to 1 collectors for me@xxxxxxxxxxxxxxxxxxx

10/16 21:28:54 (pid:5225) After chmod(), still can't remove"/tmp/condor_g_scratch.0x9931278.4435" as directory owner, giving up!

10/16 21:28:54 (pid:5225) Started condor_gmanager for owner me pid=5239

10/16 21:30:49 (pid:5225) condor_gridmanager (PID 5239, owner me) exitedwith return code 0.10/16 21:33:54 (pid:5225) GCB: ERROR "GCB_bind: binding the socketlocally failed" errno 98: Address already in use

10/16 21:33:54 (pid:5225) Sent owner (0 jobs) ad to 1 collectors








[me@mycomputer ~]$ cat /home/condor/log/CollectorLog
10/16 21:28:54 ******************************************************
10/16 21:28:54 ** condor_collector (CONDOR_COLLECTOR) STARTING UP
10/16 21:28:54 ** /opt/condor-release-6.8.8/sbin/condor_collector
10/16 21:28:54 ** $CondorVersion: 6.8.8 Dec 19 2007 $
10/16 21:28:54 ** $CondorPlatform: I386-LINUX_RHEL3 $
10/16 21:28:54 ** PID = 5222

10/16 21:28:54 ** Log last touched time unavailable (No such file ordirectory)

10/16 21:28:54 ******************************************************
10/16 21:28:54 Using config source: /home/condor/condor_config
10/16 21:28:54 Using local config sources:
10/16 21:28:54    /home/condor/condor_config.local
10/16 21:28:54 DaemonCore: Command Socket at <129.130.131.132:9618>
10/16 21:28:54 In ViewServer::Init()
10/16 21:28:54 In CollectorDaemon::Init()
10/16 21:28:54 In ViewServer::Config()
10/16 21:28:54 In CollectorDaemon::Config()
10/16 21:28:54 enable: Creating stats hash table
10/16 21:28:54 (Sending 0 ads in response to query)
10/16 21:28:54 Got QUERY_STARTD_PVT_ADS
10/16 21:28:54 (Sending 0 ads in response to query)
10/16 21:28:54 NegotiatorAd  : Inserting ** "< mycomputer.ed.ac.uk >"

10/16 21:28:54 stats: Inserting new hashent for'Negotiator':'mycomputer.ed.ac.uk':'129.130.131.132'

10/16 21:28:54 WARNING:  No master ad for < mycomputer.ed.ac.uk >

10/16 21:28:54 ScheddAd : Inserting ** "< mycomputer.ed.ac.uk ,129.130.131.132 >"10/16 21:28:54 stats: Inserting new hashent for'Schedd':'mycomputer.ed.ac.uk':'129.130.131.132'10/16 21:28:54 SubmittorAd : Inserting ** "<me@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 129.130.131.132 >"10/16 21:28:54 stats: Inserting new hashent for'Submittor':'me@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'10/16 21:28:59 ** Master < mycomputer.ed.ac.uk > rejuvenated fromrecently down10/16 21:28:59 stats: Inserting new hashent for'Master':'mycomputer.ed.ac.uk':'129.130.131.132'

10/16 21:29:06 WARNING:  No master ad for < vm1@xxxxxxxxxxxxxxxxxxx >

10/16 21:29:06 StartdAd : Inserting ** "< vm1@xxxxxxxxxxxxxxxxxxx ,129.130.131.132 >"10/16 21:29:06 stats: Inserting new hashent for'Start':'vm1@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'10/16 21:29:06 StartdPvtAd : Inserting ** "< vm1@xxxxxxxxxxxxxxxxxxx ,129.130.131.132 >"10/16 21:29:06 stats: Inserting new hashent for'StartdPvt':'vm1@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'

10/16 21:29:07 WARNING:  No master ad for < vm2@xxxxxxxxxxxxxxxxxxx >

10/16 21:29:07 StartdAd : Inserting ** "< vm2@xxxxxxxxxxxxxxxxxxx ,129.130.131.132 >"10/16 21:29:07 stats: Inserting new hashent for'Start':'vm2@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'10/16 21:29:07 StartdPvtAd : Inserting ** "< vm2@xxxxxxxxxxxxxxxxxxx ,129.130.131.132 >"10/16 21:29:07 stats: Inserting new hashent for'StartdPvt':'vm2@xxxxxxxxxxxxxxxxxxx':'129.130.131.132'




[me@mycomputer ~]$ cat /home/condor/log/GCB_BrokerLog
10/16 21:28:54 ****************************************
10/16 21:28:54 New log file started
10/16 21:28:54 Max size = 640000
10/16 21:28:54 Log level: D_BASIC
10/16 21:28:54 ****************************************
10/16 21:28:54 [broker.C:199] ++++++++++++++++++++++++++++++
10/16 21:28:54 [broker.C:200] + STARTING Broker (pid: 5226)
10/16 21:28:54 [broker.C:201] + $GCBVersion: 1.3.2 $
10/16 21:28:54 [broker.C:202] + $GCBBuildDate: Dec 19 2007 $
10/16 21:28:54 [broker.C:255] + Listening at 129.130.131.132:65432

10/16 21:28:54 [broker.C:275] + Using relay_server:/opt/condor-release-6.8.8/libexec/gcb_relay_server

10/16 21:28:54 [broker.C:276] ++++++++++++++++++++++++++++++

Thank you for your help,

Jean-Alain

Jean-Alain,

It sounds like there are two problems when your glideins try to run:

10/15 22:19:10 Can't connect to <129.130.131.132:9618>:0, errno = 113
Is that the address of the collector to which the glideins should beadvertising themselves? Are there any firewalls or anything that wouldprevent them from connecting?
10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
This confusing line in the logs should be ignored. It is no longerproduced in the 7.0 or 7.1 series of condor.
10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file ResMgr.C
I don't know what is going wrong. Do you have any special SMP-relatedconfiguration in your glidein configuration? Normally, one configuresglideins with NUM_CPUS=1 to force each instance of glidein to onlyadvertise a single slot, rather than one per cpu on the machine.
--Dan

Jean-Alain Grunchec wrote:
Hi,
I am trying to run GlideIn jobs on the UK National Grid Service. I setup a local machine as a Condor Central Manager. I put Globus-Toolkit 4on it.
Now I try to submit GlideIn jobs on a HPC in Leeds (the ultimate ideabeing the submission of many GlideIn jobs to several NGS resources).
So I start the following command, which starts something at least in Leeds.
Ideally, at this point I would have 10 new machines added to my Condorpool, so I check
[me@mycomputer test]$ condor_glidein -count 10 -arch6.6.7-i686-pc-Linux-2.4 -setup_jobmanager jobmanager-forkngs.leeds.ac.uk/jobmanager-pbs
Running/verifying Glidein installation and setup...
Submitting Glidein setup job...
Installing /home/ngs0123/Condor_glidein/glidein_condor_config.
Installing/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/glidein_startup.Installing Condor daemons in/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.
Downloadedhttp://www.cs.wisc.edu/condor/glidein/binaries/6.6.7-i686-pc-Linux-2.4.tar.gzto /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.
Installation successfully completed.

Launching Glidein...
Submitting Glidein job...
Submitting job(s).
1 job(s) submitted to cluster 5.
You have new mail in /var/spool/mail/me
Ideally, at this point I would have 10 new machine added to my Condorpool, so I check, but there is no new machine there. I read the email sent :
Date: Wed, 15 Oct 2008 22:14:19 +0100
From: Me <me@xxxxxxxxxxxxxxxxxxx>
Message-Id: <200810152114.m9FLEJAu003357@xxxxxxxxxxxxxxxxxxx>
To: me@xxxxxxxxxxxxxxxxxxx
Subject: [Condor] Condor Job 4.0

This is an automated email from the Condor system
on machine "mycomputer.ed.ac.uk".  Do not reply.

Your Condor job 4.0
/home/me/test/glidein_remote_setup.3117 $(HOME)/Condor_glidein$(HOME)/Condor_glidein/6.6.7-i686-pc-Linux-2.4 6.6.7-i686-pc-Linux-2.4$(HOME)/Condor_glidein/local'http://www.cs.wisc.edu/condor/glidein/binaries,gsiftp://gridftp.cs.wisc.edu/p/condor/public/binaries/glidein'0
has exited.


Submitted at:        Wed Oct 15 22:10:55 2008
Completed at:        Wed Oct 15 22:14:19 2008
Real Time:             0 00:03:24

Something has run somehow, but I am not sure GlideIn jobs really ran OK.
So I try to see on the headnode in Leeds if there are some temporaryfiles left, and yes, there are a few.
10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_master (CONDOR_MASTER) STARTING UP
10/15 22:19:05 **/nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_master
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20178
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file:/home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:53102>
10/15 22:19:05 Started DaemonCore process"/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd",pid and pgroup = 20179
10/15 22:19:07 The STARTD (pid 20179) exited with status 4
10/15 22:19:07 Sending obituary for"/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd"10/15 22:19:07 restarting/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd in 10seconds
10/15 22:19:10 Can't connect to <129.130.131.132:9618>:0, errno = 113
10/15 22:19:10 Will keep trying for 10 seconds...
10/15 22:19:19 Connect failed for 10 seconds; returning FALSE
10/15 22:19:19 ERROR:
SECMAN:2003:TCP connection to <129.130.131.132:9618> failed


Here apparently there is a connection to the server issue.


I read at the CONDOR_STARTD in Leeds and it is even more bizarre.

[ngs0123@ngs log.10.141.0.9-20178]$ cat StartdLog
10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_startd (CONDOR_STARTD) STARTING UP
10/15 22:19:05 **/nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20179
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file:/home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:48585>
10/15 22:19:07 ERROR: Can't allocate 5th virtual machine of type 0
       Requesting: Cpus: 1, Memory: 1, Swap: 10.00%, Disk: 10.00%
       Available:  Cpus: 6, Memory: 0, Swap: 60.00%, Disk: 60.00%
10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in fileResMgr.C
10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success

10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
Has anybody managed to use glideIn on the NGS ? Alternatively, ifsomebody has used glideIn on another Grid, your experience may help me.
Thank you very much,

J-A
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:https://lists.cs.wisc.edu/archive/condor-users/



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

References:
- [Condor-users] Jobs License Management
  - From: kschwarz
- Re: [Condor-users] Jobs License Management
  - From: Matthew Farrellee
- Re: [Condor-users] Jobs License Management
  - From: Jason Stowe
- Re: [Condor-users] Jobs License Management
  - From: Stuart Anderson
- Re: [Condor-users] Jobs License Management
  - From: Matthew Farrellee
- Re: [Condor-users] Jobs License Management
  - From: Ian Chesal
- Re: [Condor-users] Jobs License Management
  - From: Stuart Anderson
- [Condor-users] Trying to run condor_glidein on the National Grid Service
  - From: Jean-Alain Grunchec
- Re: [Condor-users] Trying to run condor_glidein on the National Grid Service
  - From: Dan Bradley

Prev by Date: Re: [Condor-users] DAGMan
Next by Date: Re: [Condor-users] Trying to run condor_glidein on the National Grid Service
Previous by thread: Re: [Condor-users] Trying to run condor_glidein on the National Grid Service
Next by thread: Re: [Condor-users] Trying to run condor_glidein on the National Grid Service
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Condor-users] Trying to run condor_glidein on the National Grid Service