[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems about condor glidein



Hi Dan,
 
I changed the Grid_Resource to "gt4 server.nova.cn PBS", after that I submited the glidein setup job manually and got the following error in the job error file:
 
[agrid@server condor-test]$ cat glidein_setup.error.29885
mkdir: Could not create directory “$(HOME)/Condor_glidein”: No such file or directory.
 
There is my glidein_setup.submit.29885 file:
 
[agrid@server condor-test]$ cat glidein_setup.submit.29885
        universe = Grid
        Grid_Resource = gt4 server.nova.cn PBS
        executable = glidein_remote_setup.29885
        arguments = "'$(DOLLAR)(HOME)/Condor_glidein' '$(DOLLAR)(HOME)/Condor_glidein/7.3.2-i686-pc-Linux-2.4' '7.3.2-i686-pc-Linux-2.4' '$(DOLLAR)(HOME)/Condor_glidein/local' 'http://www.cs.wisc.edu/condor/glidein/binaries' '0'"
        #avoid trouble with scratch directory creation
        remote_initialdir = /tmp
        output = glidein_setup.output.29885
        error = glidein_setup.error.29885
        log = glidein_setup.log.29885
        queue
 
Any clue?
 
-Hailong
 
2009-10-29

***********************************************
* Hailong Yang, PhD. Candidate
* Sino-German Joint Software Institute,
* School of Computer Science&Engineering, Beihang University
* Phone: (86-010)82315908
* Email: hailong.yang1115@xxxxxxxxx
* Address: G413, New Main Building in Beihang University,
*              No.37 XueYuan Road,HaiDian District,
*              Beijing,P.R.China,100191
***********************************************

发件人: Dan Bradley
发送时间: 2009-10-27  23:29:42
收件人: Condor-Users Mail List
抄送:
主题: Re: [Condor-users] Problems about condor glidein
Hello Hailong,
The problem you are having is that Condor is assuming the grid type is 
"gt2" instead of "gt4".  Currently, the way to adjust that is to use the 
-gensubmit option to condor_glidein.  Then, instead of submitting jobs 
to Condor directly, condor_glidein will write the submit file that it 
would have used and then exit.  You can then modify the submit file 
(changing the grid type from gt2 to gt4) and submit the job to condor 
yourself.
--Dan
hailong.yang1115 wrote:
>  
> Hi everyone,
>  
> I have installed condor-7.2.4, globus-4.2.1 and torque-2.3.7 on one 
> machine to see how condor-G and condor glidein works. First I tried 
> condor-G, everything worked fine, I could submit jobs to condor and 
> through GRAM4 the jobs could run
> on PBS. But when I tried the condor glidein command, it just blocked 
> there.
>  
> [agrid@server condor-test]$ condor_glidein -count 1 -arch 7.3.2-i686-pc-Linux-2.4 -setup_jobmanager jobmanager-fork server.nova.cn/jobmanager-pbs
> Running/verifying Glidein installation and setup...
> Submitting Glidein setup job...
>  
>  
> Following files were generated in the working directory:
>  
> [agrid@server condor-test]$ ll
> Total 56
> -rwxr-xr-x 1 agrid agrid 5086 10-27 17:58 glidein_remote_setup.8140
> -rwxr-xr-x 1 agrid agrid 5086 10-27 17:59 glidein_remote_setup.8170
> -rw-rw-r-- 1 agrid agrid    0 10-27 17:59 glidein_setup.error.8170
> -rw-rw-r-- 1 agrid agrid  314 10-27 17:59 glidein_setup.log.8170
> -rw-rw-r-- 1 agrid agrid    0 10-27 17:59 glidein_setup.output.8170
> -rw-rw-r-- 1 agrid agrid  516 10-27 17:58 glidein_setup.submit.8140
> -rw-rw-r-- 1 agrid agrid  516 10-27 17:59 glidein_setup.submit.8170
> Here is the content of the glidein_setup.log.8170 file:
>  
> 000 (037.000.000) 10/27 17:59:02 Job submitted from host: <10.10.3.159:57089>
> ...
> 020 (037.000.000) 10/27 17:59:15 Detected Down Globus Resource
>     RM-Contact: server.nova.cn/jobmanager-fork
> ...
> 026 (037.000.000) 10/27 17:59:15 Detected Down Grid Resource
>     GridResource: gt2 server.nova.cn/jobmanager-fork
>  
> I noticed that condor automatically set the grid resource as gt2. Is 
> this right? Because what I am using is gt4. 
>  
> I aslo turned on the debug mode of gridmanager and got the following 
> information in the log file:
>  
> 10/27 17:59:07 ******************************************************
> 10/27 17:59:07 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
> 10/27 17:59:07 ** /opt/condor-7.2.4/sbin/condor_gridmanager
> 10/27 17:59:07 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(10) class=DAEMON(1)
> 10/27 17:59:07 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
> 10/27 17:59:07 ** $CondorVersion: 7.2.4 Jun 16 2009 BuildID: 159529 $
> 10/27 17:59:07 ** $CondorPlatform: I386-LINUX_RHEL5 $
> 10/27 17:59:07 ** PID = 8202
> 10/27 17:59:07 ** Log last touched 10/27 11:54:30
> 10/27 17:59:07 ******************************************************
> 10/27 17:59:07 Using config source: /opt/condor-7.2.4/etc/condor_config
> 10/27 17:59:07 Using local config sources: 
> 10/27 17:59:07    /opt/condor-7.2.4/local.server/condor_config.local
> 10/27 17:59:07 Running as root.  Enabling specialized core dump routines
> 10/27 17:59:07 DaemonCore: Command Socket at <10.10.3.159:34693>
> 10/27 17:59:07 Will use UDP to update collector server.nova.cn <10.10.3.159:9618>
> 10/27 17:59:07 [8202] Welcome to the all-singing, all dancing, "amazing" GridManager!
> 10/27 17:59:07 [8202] DaemonCore: in SendAliveToParent()
> 10/27 17:59:07 [8202] Initialized the following authorization table:
> 10/27 17:59:07 [8202] Authorizations yet to be resolved:
> 10/27 17:59:07 [8202] allow NEGOTIATOR:  */server.nova.cn */10.10.3.159
> 10/27 17:59:07 [8202] allow ADMINISTRATOR:  */server.nova.cn */10.10.3.159
> 10/27 17:59:07 [8202] allow OWNER:  */server.nova.cn */server.nova.cn */10.10.3.159 */10.10.3.159
> 10/27 17:59:07 [8202] DaemonCore: Leaving SendAliveToParent() - success
> 10/27 17:59:07 [8202] Checking proxies
> 10/27 17:59:10 [8202] Received ADD_JOBS signal
> 10/27 17:59:10 [8202] in doContactSchedd()
> 10/27 17:59:10 [8202] querying for new jobs
> 10/27 17:59:10 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && (Managed =!= "ScheddDone") && (((Matched =!= FA
> LSE) && (JobStatus != 5)) || (Managed =?= "External"))
> 10/27 17:59:10 [8202] Using job type Globus for job 37.0
> 10/27 17:59:10 [8202] (37.0) SetJobLeaseTimers()
> 10/27 17:59:10 [8202] Found job 37.0 --- inserting
> 10/27 17:59:10 [8202] Fetched 1 new job ads from schedd
> 10/27 17:59:10 [8202] querying for removed/held jobs
> 10/27 17:59:10 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3
>  || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
> 10/27 17:59:10 [8202] Fetched 0 job ads from schedd
> 10/27 17:59:10 [8202] leaving doContactSchedd()
> 10/27 17:59:10 [8202] gahp server not up yet, delaying ping
> 10/27 17:59:10 [8202] *** UpdateLeases called
> 10/27 17:59:10 [8202]     Leases not supported, cancelling timer
> 10/27 17:59:10 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
> 10/27 17:59:10 [8202] GAHP server not initialized yet, not submitting grid_monitor now
> 10/27 17:59:10 [8202] (37.0) doEvaluateState called: gmState GM_INIT, globusState 32
> 10/27 17:59:10 [8202] Create_Process: using fast clone() to create child process.
> 10/27 17:59:10 [8202] GAHP server pid = 8206
> 10/27 17:59:10 [8202] GAHP server version: $GahpVersion: 1.0.16 Jun 16 2009 UW Gahp $
> 10/27 17:59:10 [8202] GAHP[8206] <- 'COMMANDS'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S' 'COMMANDS' 'GASS_SERVER_INIT' 'GRAM_CALLBACK_ALLOW' 'GRAM_ERROR_STRING' 'GRAM_JOB_CAL
> LBACK_REGISTER' 'GRAM_JOB_CANCEL' 'GRAM_JOB_REQUEST' 'GRAM_JOB_SIGNAL' 'GRAM_JOB_STATUS' 'GRAM_PING' 'INITIALIZE_FROM_FILE' '
> QUIT' 'RESULTS' 'VERSION' 'ASYNC_MODE_ON' 'ASYNC_MODE_OFF' 'RESPONSE_PREFIX' 'REFRESH_PROXY_FROM_FILE' 'CACHE_PROXY_FROM_FILE
> ' 'USE_CACHED_PROXY' 'UNCACHE_PROXY' 'GRAM_JOB_REFRESH_PROXY'
> 10/27 17:59:10 [8202] GAHP[8206] <- 'RESPONSE_PREFIX GAHP:'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S'
> 10/27 17:59:10 [8202] GAHP[8206] <- 'ASYNC_MODE_ON'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S'
> 10/27 17:59:10 [8202] GAHP[8206] <- 'INITIALIZE_FROM_FILE /tmp/condor_g_scratch.0xa457c48.16424/master_proxy.2'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S'
> 10/27 17:59:10 [8202] GAHP[8206] <- 'CACHE_PROXY_FROM_FILE 2 /tmp/condor_g_scratch.0xa457c48.16424/master_proxy.2'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S'
> 10/27 17:59:10 [8202] GAHP[8206] <- 'USE_CACHED_PROXY 2'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S'
> 10/27 17:59:10 [8202] GAHP[8206] <- 'CACHE_PROXY_FROM_FILE 1 /tmp/x509up_u502'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S'
> 10/27 17:59:10 [8202] GAHP[8206] <- 'GRAM_CALLBACK_ALLOW 2 0'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S' 'https://server.nova.cn:56844/'
> 10/27 17:59:10 [8202] GAHP[8206] <- 'GASS_SERVER_INIT 3 0'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S'
> 10/27 17:59:10 [8202] GAHP[8206] <- 'RESULTS'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'R'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S' '1'
> 10/27 17:59:10 [8202] GAHP[8206] -> '3' '0' 'https://server.nova.cn:51333'
> 10/27 17:59:10 [8202] (37.0) gm state change: GM_INIT -> GM_START
> 10/27 17:59:10 [8202] (37.0) gm state change: GM_START -> GM_CLEAR_REQUEST
> 10/27 17:59:10 [8202] (37.0) gm state change: GM_CLEAR_REQUEST -> GM_UNSUBMITTED
> 10/27 17:59:10 [8202] (37.0) gm state change: GM_UNSUBMITTED -> GM_SUBMIT
> 10/27 17:59:10 [8202] Final RSL: &(rsl_substitution=(GRIDMANAGER_GASS_URL https://server.nova.cn:51333))(executable=$(GRIDMAN
> AGER_GASS_URL)#'/home/agrid/condor-test/glidein_remote_setup.8170')(directory='/tmp')(arguments=$(HOME)#'/Condor_glidein' $(H
> OME)#'/Condor_glidein/7.3.2-i686-pc-Linux-2.4' '7.3.2-i686-pc-Linux-2.4' $(HOME)#'/Condor_glidein/local' 'http://www.cs.wisc.
> edu/condor/glidein/binaries' '0')(stdout=$(GLOBUS_CACHED_STDOUT))(stderr=$(GLOBUS_CACHED_STDERR))(file_stage_out=($(GLOBUS_CA
> CHED_STDOUT) $(GRIDMANAGER_GASS_URL)#'/home/agrid/condor-test/glidein_setup.output.8170')($(GLOBUS_CACHED_STDERR) $(GRIDMANAG
> ER_GASS_URL)#'/home/agrid/condor-test/glidein_setup.error.8170'))(proxy_timeout=240)(save_state=yes)(two_phase=600)(remote_io
> _url=$(GRIDMANAGER_GASS_URL))
> 10/27 17:59:10 [8202] GAHP[8206] <- 'RESULTS'
> 10/27 17:59:10 [8202] GAHP[8206] -> 'S' '0'
> 10/27 17:59:10 [8202] (37.0) doEvaluateState called: gmState GM_SUBMIT, globusState 32
> 10/27 17:59:15 [8202] GAHP[8206] <- 'GRAM_PING 4 server.nova.cn:2119'
> 10/27 17:59:15 [8202] GAHP[8206] -> 'S'
> 10/27 17:59:15 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
> 10/27 17:59:15 [8202] grid_monitor for server.nova.cn:2119: first ping not done yet, will retry later
> 10/27 17:59:15 [8202] GAHP[8206] <- 'RESULTS'
> 10/27 17:59:15 [8202] GAHP[8206] -> 'R'
> 10/27 17:59:15 [8202] GAHP[8206] -> 'S' '1'
> 10/27 17:59:15 [8202] GAHP[8206] -> '4' '79'
> 10/27 17:59:15 [8202] resource server.nova.cn:2119 is now down
> 10/27 17:59:15 [8202] FileLock object is updating timestamp on: /home/agrid/condor-test/glidein_setup.log.8170
> 10/27 17:59:15 [8202] (37.0) Writing globus down record to user logfile
> 10/27 17:59:15 [8202] FileLock::obtain(1) - @1256637555.411181 lock on /home/agrid/condor-test/glidein_setup.log.8170 now WRI
> TE
> 10/27 17:59:15 [8202] FileLock::obtain(2) - @1256637555.431371 lock on /home/agrid/condor-test/glidein_setup.log.8170 now UNL
> OCKED
> 10/27 17:59:15 [8202] FileLock object is updating timestamp on: /home/agrid/condor-test/glidein_setup.log.8170
> 10/27 17:59:15 [8202] (37.0) Writing grid source down record to user logfile
> 10/27 17:59:15 [8202] FileLock::obtain(1) - @1256637555.431812 lock on /home/agrid/condor-test/glidein_setup.log.8170 now WRI
> TE
> 10/27 17:59:15 [8202] FileLock::obtain(2) - @1256637555.437901 lock on /home/agrid/condor-test/glidein_setup.log.8170 now UNL
> OCKED
> 10/27 17:59:15 [8202] in doContactSchedd()
> 10/27 17:59:15 [8202] querying for removed/held jobs
> 10/27 17:59:15 [8202] Using constraint ((Owner=?="agrid"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3
>  || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
> 10/27 17:59:15 [8202] Fetched 0 job ads from schedd
> 10/27 17:59:15 [8202] Updating classad values for 37.0:
> 10/27 17:59:15 [8202]    GridResourceUnavailableTime = 1256637555
> 10/27 17:59:15 [8202]    GlobusResourceUnavailableTime = 1256637555
> 10/27 17:59:15 [8202] leaving doContactSchedd()
> 10/27 17:59:15 [8202] (37.0) doEvaluateState called: gmState GM_SUBMIT, globusState 32
> 10/27 17:59:20 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
> 10/27 17:59:20 [8202] GAHP[8206] <- 'GRAM_JOB_REQUEST 5 server.nova.cn:2119/jobmanager-fork NULL 1 &(executable=https://serve
> r.nova.cn:51333/opt/condor-7.2.4/sbin/grid_monitor.sh)(stdout=https://server.nova.cn:51333/tmp/condor_g_scratch.0xa457c48.164
> 24/grid-monitor.server.nova.cn:2119.1/grid-monitor-log)(arguments='--dest-url="">
> ch.0xa457c48.16424/grid-monitor.server.nova.cn:2119.1/grid-monitor-job-status')'
> 10/27 17:59:20 [8202] GAHP[8206] -> 'S'
> 10/27 17:59:20 [8202] GAHP[8206] <- 'RESULTS'
> 10/27 17:59:20 [8202] GAHP[8206] -> 'R'
> 10/27 17:59:20 [8202] GAHP[8206] -> 'S' '1'
> 10/27 17:59:20 [8202] GAHP[8206] -> '5' '12' 'NULL'
> 10/27 17:59:20 [8202] grid_monitor for server.nova.cn:2119 entering CheckMonitor
> 10/27 17:59:20 [8202] GAHP[8206] <- 'GRAM_ERROR_STRING 12'
> 10/27 17:59:20 [8202] GAHP[8206] -> 'S' 'the connection to the server failed (check host and port)'
> 10/27 17:59:20 [8202] grid_monitor job submit failed for resource server.nova.cn:2119, gram error 12 (the connection to the s
> erver failed (check host and port))
> 10/27 17:59:20 [8202] Giving up on grid_monitor for site server.nova.cn:2119.  Will retry in 3600 seconds (60 minutes)
> 10/27 17:59:20 [8202] Stopping grid_monitor for resource server.nova.cn:2119
>  
> I think this log file indicated that the server.nova.cn:2119 is down, 
> howerver I checked this port and got the following answer:
> [agrid@server condor-test]$ netstat -nat | grep 2119
> tcp        0      0 0.0.0.0:2119                0.0.0.0:*                   LISTEN
>  
> Any idea will be appreciated.
>  
> -Hailong
>  
> 2009-10-27
> ------------------------------------------------------------------------
> ***********************************************
> * Hailong Yang, PhD. Candidate
> * Sino-German Joint Software Institute,
> * School of Computer Science&Engineering, Beihang University
> * Phone: (86-010)82315908
> * Email: hailong.yang1115@xxxxxxxxx <mailto:hailong.yang1115@xxxxxxxxx>
> * Address: G413, New Main Building in Beihang University,
> *              No.37 XueYuan Road,HaiDian District,
> *              Beijing,P.R.China,100191
> ***********************************************
> ------------------------------------------------------------------------
>
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at: 
> https://lists.cs.wisc.edu/archive/condor-users/
>   
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/