Hi,
I am trying to run GlideIn jobs on the UK National Grid Service. I set
up a local machine as a Condor Central Manager. I put Globus-Toolkit 4
on it.
Now I try to submit GlideIn jobs on a HPC in Leeds (the ultimate idea
being the submission of many GlideIn jobs to several NGS resources).
So I start the following command, which starts something at least in Leeds.
Ideally, at this point I would have 10 new machines added to my Condor
pool, so I check
[me@mycomputer test]$ condor_glidein -count 10 -arch
6.6.7-i686-pc-Linux-2.4 -setup_jobmanager jobmanager-fork
ngs.leeds.ac.uk/jobmanager-pbs
Running/verifying Glidein installation and setup...
Submitting Glidein setup job...
Installing /home/ngs0123/Condor_glidein/glidein_condor_config.
Installing
/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/glidein_startup.
Installing Condor daemons in
/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.
Downloaded
http://www.cs.wisc.edu/condor/glidein/binaries/6.6.7-i686-pc-Linux-2.4.tar.gz
to /home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4.
Installation successfully completed.
Launching Glidein...
Submitting Glidein job...
Submitting job(s).
1 job(s) submitted to cluster 5.
You have new mail in /var/spool/mail/me
Ideally, at this point I would have 10 new machine added to my Condor
pool, so I check, but there is no new machine there. I read the email sent :
Date: Wed, 15 Oct 2008 22:14:19 +0100
From: Me <me@xxxxxxxxxxxxxxxxxxx>
Message-Id: <200810152114.m9FLEJAu003357@xxxxxxxxxxxxxxxxxxx>
To: me@xxxxxxxxxxxxxxxxxxx
Subject: [Condor] Condor Job 4.0
This is an automated email from the Condor system
on machine "mycomputer.ed.ac.uk". Do not reply.
Your Condor job 4.0
/home/me/test/glidein_remote_setup.3117 $(HOME)/Condor_glidein
$(HOME)/Condor_glidein/6.6.7-i686-pc-Linux-2.4 6.6.7-i686-pc-Linux-2.4
$(HOME)/Condor_glidein/local
'http://www.cs.wisc.edu/condor/glidein/binaries,gsiftp://gridftp.cs.wisc.edu/p/condor/public/binaries/glidein'
0
has exited.
Submitted at: Wed Oct 15 22:10:55 2008
Completed at: Wed Oct 15 22:14:19 2008
Real Time: 0 00:03:24
Something has run somehow, but I am not sure GlideIn jobs really ran OK.
So I try to see on the headnode in Leeds if there are some temporary
files left, and yes, there are a few.
10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_master (CONDOR_MASTER) STARTING UP
10/15 22:19:05 **
/nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_master
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20178
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file:
/home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:53102>
10/15 22:19:05 Started DaemonCore process
"/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd",
pid and pgroup = 20179
10/15 22:19:07 The STARTD (pid 20179) exited with status 4
10/15 22:19:07 Sending obituary for
"/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd"
10/15 22:19:07 restarting
/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd in 10
seconds
10/15 22:19:10 Can't connect to <129.130.131.132:9618>:0, errno = 113
10/15 22:19:10 Will keep trying for 10 seconds...
10/15 22:19:19 Connect failed for 10 seconds; returning FALSE
10/15 22:19:19 ERROR:
SECMAN:2003:TCP connection to <129.130.131.132:9618> failed
Here apparently there is a connection to the server issue.
I read at the CONDOR_STARTD in Leeds and it is even more bizarre.
[ngs0123@ngs log.10.141.0.9-20178]$ cat StartdLog
10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
10/15 22:19:05 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
10/15 22:19:05 ******************************************************
10/15 22:19:05 ** condor_startd (CONDOR_STARTD) STARTING UP
10/15 22:19:05 **
/nfs/ift02_h01/home/ngs0123/Condor_glidein/6.6.7-i686-pc-Linux-2.4/condor_startd
10/15 22:19:05 ** $CondorVersion: 6.6.7 Oct 11 2004 $
10/15 22:19:05 ** $CondorPlatform: I386-LINUX_RH9 $
10/15 22:19:05 ** PID = 20179
10/15 22:19:05 ******************************************************
10/15 22:19:05 Using config file:
/home/ngs0123/Condor_glidein/glidein_condor_config
10/15 22:19:05 DaemonCore: Command Socket at <10.141.0.9:48585>
10/15 22:19:07 ERROR: Can't allocate 5th virtual machine of type 0
Requesting: Cpus: 1, Memory: 1, Swap: 10.00%, Disk: 10.00%
Available: Cpus: 6, Memory: 0, Swap: 60.00%, Disk: 60.00%
10/15 22:19:07 ERROR "Ran out of system resources" at line 361 in file
ResMgr.C
10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
10/15 22:19:31 passwd_cache::cache_uid(): getpwnam("condor") failed: Success
Has anybody managed to use glideIn on the NGS ? Alternatively, if
somebody has used glideIn on another Grid, your experience may help me.
Thank you very much,
J-A