Hi everyone,
I have installed condor-7.2.4, globus-4.2.1 and torque-2.3.7 on one
machine to see how condor-G and condor glidein works. First I tried
condor-G, everything worked fine, I could submit jobs to condor and
through GRAM4 the jobs could run
on PBS. But when I tried the condor glidein command, it just blocked
there.
[agrid@server condor-test]$ condor_glidein -count 1 -arch 7.3.2-i686-pc-Linux-2.4 -setup_jobmanager jobmanager-fork server.nova.cn/jobmanager-pbs
Running/verifying Glidein installation and setup...
Submitting Glidein setup job...
Following files were generated in the working directory:
[agrid@server condor-test]$ ll
Total 56
-rwxr-xr-x 1 agrid agrid 5086 10-27 17:58 glidein_remote_setup.8140
-rwxr-xr-x 1 agrid agrid 5086 10-27 17:59 glidein_remote_setup.8170
-rw-rw-r-- 1 agrid agrid 0 10-27 17:59 glidein_setup.error.8170
-rw-rw-r-- 1 agrid agrid 314 10-27 17:59 glidein_setup.log.8170
-rw-rw-r-- 1 agrid agrid 0 10-27 17:59 glidein_setup.output.8170
-rw-rw-r-- 1 agrid agrid 516 10-27 17:58 glidein_setup.submit.8140
-rw-rw-r-- 1 agrid agrid 516 10-27 17:59 glidein_setup.submit.8170
Here is the content of the glidein_setup.log.8170 file:
000 (037.000.000) 10/27 17:59:02 Job submitted from host: <10.10.3.159:57089>
...
020 (037.000.000) 10/27 17:59:15 Detected Down Globus Resource
RM-Contact: server.nova.cn/jobmanager-fork
...
026 (037.000.000) 10/27 17:59:15 Detected Down Grid Resource
GridResource: gt2 server.nova.cn/jobmanager-fork
I noticed that condor automatically set the grid resource as gt2. Is
this right? Because what I am using is gt4.
I aslo turned on the debug mode of gridmanager and got the following
information in the log file:
10/27 17:59:07 ******************************************************
10/27 17:59:07 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
10/27 17:59:07 ** /opt/condor-7.2.4/sbin/condor_gridmanager
10/27 17:59:07 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(10) class=DAEMON(1)
10/27 17:59:07 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
10/27 17:59:07 ** $CondorVersion: 7.2.4 Jun 16 2009 BuildID: 159529 $
10/27 17:59:07 ** $CondorPlatform: I386-LINUX_RHEL5 $
10/27 17:59:07 ** PID = 8202
10/27 17:59:07 ** Log last touched 10/27 11:54:30
10/27 17:59:07 ******************************************************
10/27 17:59:07 Using config source: /opt/condor-7.2.4/etc/condor_config
10/27 17:59:07 Using local config sources:
10/27 17:59:07 /opt/condor-7.2.4/local.server/condor_config.local
10/27 17:59:07 Running as root. Enabling specialized core dump routines
10/27 17:59:07 DaemonCore: Command Socket at <10.10.3.159:34693>
10/27 17:59:07 Will use UDP to update collector server.nova.cn <10.10.3.159:9618>
10/27 17:59:07 [8202] Welcome to the all-singing, all dancing, "amazing" GridManager!
10/27 17:59:07 [8202] DaemonCore: in SendAliveToParent()
10/27 17:59:07 [8202] Initialized the following authorization table:
10/27 17:59:07 [8202] Authorizations yet to be resolved:
10/27 17:59:07 [8202] allow NEGOTIATOR: */server.nova.cn */10.10.3.159
10/27 17:59:07 [8202] allow ADMINISTRATOR: */server.nova.cn */10.10.3.159
10/27 17:59:07 [8202] allow OWNER: */server.nova.cn */server.nova.cn */10.10.3.159 */10.10.3.159
10/27 17:59:07 [8202] DaemonCore: Leaving SendAliveToParent() - success
10/27 17:59:07 [8202] Checking proxies
10/27 17:59:10 [8202] Received ADD_JOBS signal
10/27 17:59:10 [8202] in doContactSchedd()
10/27 17:59:10 [8202] querying for new jobs
10/27 17:59:10 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && (Managed =!= "ScheddDone") && (((Matched =!= FA
LSE) && (JobStatus != 5)) || (Managed =?= "External"))
10/27 17:59:10 [8202] Using job type Globus for job 37.0
10/27 17:59:10 [8202] (37.0) SetJobLeaseTimers()
10/27 17:59:10 [8202] Found job 37.0 --- inserting
10/27 17:59:10 [8202] Fetched 1 new job ads from schedd
10/27 17:59:10 [8202] querying for removed/held jobs
10/27 17:59:10 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3
|| JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
10/27 17:59:10 [8202] Fetched 0 job ads from schedd
10/27 17:59:10 [8202] leaving doContactSchedd()
10/27 17:59:10 [8202] gahp server not up yet, delaying ping
10/27 17:59:10 [8202] *** UpdateLeases called
10/27 17:59:10 [8202] Leases not supported, cancelling timer
10/27 17:59:10 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:10 [8202] GAHP server not initialized yet, not submitting grid_monitor now
10/27 17:59:10 [8202] (37.0) doEvaluateState called: gmState GM_INIT, globusState 32
10/27 17:59:10 [8202] Create_Process: using fast clone() to create child process.
10/27 17:59:10 [8202] GAHP server pid = 8206
10/27 17:59:10 [8202] GAHP server version: $GahpVersion: 1.0.16 Jun 16 2009 UW Gahp $
10/27 17:59:10 [8202] GAHP[8206] <- 'COMMANDS'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' 'COMMANDS' 'GASS_SERVER_INIT' 'GRAM_CALLBACK_ALLOW' 'GRAM_ERROR_STRING' 'GRAM_JOB_CAL
LBACK_REGISTER' 'GRAM_JOB_CANCEL' 'GRAM_JOB_REQUEST' 'GRAM_JOB_SIGNAL' 'GRAM_JOB_STATUS' 'GRAM_PING' 'INITIALIZE_FROM_FILE' '
QUIT' 'RESULTS' 'VERSION' 'ASYNC_MODE_ON' 'ASYNC_MODE_OFF' 'RESPONSE_PREFIX' 'REFRESH_PROXY_FROM_FILE' 'CACHE_PROXY_FROM_FILE
' 'USE_CACHED_PROXY' 'UNCACHE_PROXY' 'GRAM_JOB_REFRESH_PROXY'
10/27 17:59:10 [8202] GAHP[8206] <- 'RESPONSE_PREFIX GAHP:'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'ASYNC_MODE_ON'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'INITIALIZE_FROM_FILE /tmp/condor_g_scratch.0xa457c48.16424/master_proxy.2'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'CACHE_PROXY_FROM_FILE 2 /tmp/condor_g_scratch.0xa457c48.16424/master_proxy.2'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'USE_CACHED_PROXY 2'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'CACHE_PROXY_FROM_FILE 1 /tmp/x509up_u502'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'GRAM_CALLBACK_ALLOW 2 0'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' 'https://server.nova.cn:56844/'
10/27 17:59:10 [8202] GAHP[8206] <- 'GASS_SERVER_INIT 3 0'
10/27 17:59:10 [8202] GAHP[8206] -> 'S'
10/27 17:59:10 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:10 [8202] GAHP[8206] -> 'R'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' '1'
10/27 17:59:10 [8202] GAHP[8206] -> '3' '0' 'https://server.nova.cn:51333'
10/27 17:59:10 [8202] (37.0) gm state change: GM_INIT -> GM_START
10/27 17:59:10 [8202] (37.0) gm state change: GM_START -> GM_CLEAR_REQUEST
10/27 17:59:10 [8202] (37.0) gm state change: GM_CLEAR_REQUEST -> GM_UNSUBMITTED
10/27 17:59:10 [8202] (37.0) gm state change: GM_UNSUBMITTED -> GM_SUBMIT
10/27 17:59:10 [8202] Final RSL: &(rsl_substitution=(GRIDMANAGER_GASS_URL https://server.nova.cn:51333))(executable=$(GRIDMAN
AGER_GASS_URL)#'/home/agrid/condor-test/glidein_remote_setup.8170')(directory='/tmp')(arguments=$(HOME)#'/Condor_glidein' $(H
OME)#'/Condor_glidein/7.3.2-i686-pc-Linux-2.4' '7.3.2-i686-pc-Linux-2.4' $(HOME)#'/Condor_glidein/local' 'http://www.cs.wisc.
edu/condor/glidein/binaries' '0')(stdout=$(GLOBUS_CACHED_STDOUT))(stderr=$(GLOBUS_CACHED_STDERR))(file_stage_out=($(GLOBUS_CA
CHED_STDOUT) $(GRIDMANAGER_GASS_URL)#'/home/agrid/condor-test/glidein_setup.output.8170')($(GLOBUS_CACHED_STDERR) $(GRIDMANAG
ER_GASS_URL)#'/home/agrid/condor-test/glidein_setup.error.8170'))(proxy_timeout=240)(save_state=yes)(two_phase=600)(remote_io
_url=$(GRIDMANAGER_GASS_URL))
10/27 17:59:10 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:10 [8202] GAHP[8206] -> 'S' '0'
10/27 17:59:10 [8202] (37.0) doEvaluateState called: gmState GM_SUBMIT, globusState 32
10/27 17:59:15 [8202] GAHP[8206] <- 'GRAM_PING 4 server.nova.cn:2119'
10/27 17:59:15 [8202] GAHP[8206] -> 'S'
10/27 17:59:15 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:15 [8202] grid_monitor for server.nova.cn:2119: first ping not done yet, will retry later
10/27 17:59:15 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:15 [8202] GAHP[8206] -> 'R'
10/27 17:59:15 [8202] GAHP[8206] -> 'S' '1'
10/27 17:59:15 [8202] GAHP[8206] -> '4' '79'
10/27 17:59:15 [8202] resource server.nova.cn:2119 is now down
10/27 17:59:15 [8202] FileLock object is updating timestamp on: /home/agrid/condor-test/glidein_setup.log.8170
10/27 17:59:15 [8202] (37.0) Writing globus down record to user logfile
10/27 17:59:15 [8202] FileLock::obtain(1) - @1256637555.411181 lock on /home/agrid/condor-test/glidein_setup.log.8170 now WRI
TE
10/27 17:59:15 [8202] FileLock::obtain(2) - @1256637555.431371 lock on /home/agrid/condor-test/glidein_setup.log.8170 now UNL
OCKED
10/27 17:59:15 [8202] FileLock object is updating timestamp on: /home/agrid/condor-test/glidein_setup.log.8170
10/27 17:59:15 [8202] (37.0) Writing grid source down record to user logfile
10/27 17:59:15 [8202] FileLock::obtain(1) - @1256637555.431812 lock on /home/agrid/condor-test/glidein_setup.log.8170 now WRI
TE
10/27 17:59:15 [8202] FileLock::obtain(2) - @1256637555.437901 lock on /home/agrid/condor-test/glidein_setup.log.8170 now UNL
OCKED
10/27 17:59:15 [8202] in doContactSchedd()
10/27 17:59:15 [8202] querying for removed/held jobs
10/27 17:59:15 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3
|| JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
10/27 17:59:15 [8202] Fetched 0 job ads from schedd
10/27 17:59:15 [8202] Updating classad values for 37.0:
10/27 17:59:15 [8202] GridResourceUnavailableTime = 1256637555
10/27 17:59:15 [8202] GlobusResourceUnavailableTime = 1256637555
10/27 17:59:15 [8202] leaving doContactSchedd()
10/27 17:59:15 [8202] (37.0) doEvaluateState called: gmState GM_SUBMIT, globusState 32
10/27 17:59:20 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:20 [8202] GAHP[8206] <- 'GRAM_JOB_REQUEST 5 server.nova.cn:2119/jobmanager-fork NULL 1 &(executable=https://serve
r.nova.cn:51333/opt/condor-7.2.4/sbin/grid_monitor.sh)(stdout=https://server.nova.cn:51333/tmp/condor_g_scratch.0xa457c48.164
24/grid-monitor.server.nova.cn:2119.1/grid-monitor-log)(arguments='--dest-url=https://server.nova.cn:51333/tmp/condor_g_scrat
ch.0xa457c48.16424/grid-monitor.server.nova.cn:2119.1/grid-monitor-job-status')'
10/27 17:59:20 [8202] GAHP[8206] -> 'S'
10/27 17:59:20 [8202] GAHP[8206] <- 'RESULTS'
10/27 17:59:20 [8202] GAHP[8206] -> 'R'
10/27 17:59:20 [8202] GAHP[8206] -> 'S' '1'
10/27 17:59:20 [8202] GAHP[8206] -> '5' '12' 'NULL'
10/27 17:59:20 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
10/27 17:59:20 [8202] GAHP[8206] <- 'GRAM_ERROR_STRING 12'
10/27 17:59:20 [8202] GAHP[8206] -> 'S' 'the connection to the server failed (check host and port)'
10/27 17:59:20 [8202] grid_monitor job submit failed for resource server.nova.cn:2119, gram error 12 (the connection to the s
erver failed (check host and port))
10/27 17:59:20 [8202] Giving up on grid_monitor for site server.nova.cn:2119. Will retry in 3600 seconds (60 minutes)
10/27 17:59:20 [8202] Stopping grid_monitor for resource server.nova.cn:2119
I think this log file indicated that the server.nova.cn:2119 is down,
howerver I checked this port and got the following answer:
[agrid@server condor-test]$ netstat -nat | grep 2119
tcp 0 0 0.0.0.0:2119 0.0.0.0:* LISTEN
Any idea will be appreciated.
-Hailong
2009-10-27
------------------------------------------------------------------------
***********************************************
* Hailong Yang, PhD. Candidate
* Sino-German Joint Software Institute,
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1115@xxxxxxxxx <mailto:hailong.yang1115@xxxxxxxxx>
* Address: G413, New Main Building in Beihang University,
* No.37 XueYuan Road,HaiDian District,
* Beijing,P.R.China,100191
***********************************************
------------------------------------------------------------------------
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/