Hi, I have two identical installations: same hardware, same OS (Suse90), globus 4.0.3, condor 6.7.19 I386-LINUX_RH9. On both machines I installed WS-GRAM. Machine gtx01.esrf.fr is condor central manager with startd and schedd, machine gtx02.esrf.fr is condor node with startd and schedd. When I submit a condor-G job on gtx02, everything works fine. Same thing on gtx01 does not work. Here the job description file: universe = grid grid_resource = gt4 https://gtx02.esrf.fr:8443/wsrf/services/ManagedJobFactoryService Condor executable = /users/klotz/pybench/pybench.py transfer_executable = False output = $(Cluster).$(Process).pybench.globus.out stream_output = False error = $(Cluster).$(Process).pybench.globus.err stream_error = False log = $(Cluster).pybench.globus.log notification = Error requirements = (Arch == "x86_64") rank = Mips initialdir = /users/klotz/tmp queue Here a table that shows in detail the situation: condor_submit using grid resource using condor-G on WS-GRAM gridmanager on on result ----------------------------------------------------------------------------------------- gtx02 gtx02 gtx02 OK gtx01 gtx01 gtx01 does not work gtx02 gtx01 gtx02 OK On both machines I logged gridmanager (with GRIDMANAGER_DEBUG=D_FULLDEBUG) . The log files start to deviate when the job is submitted to WS-GRAM by the gahp server. Here the output where things go well: VVVVVVVVVVVVVVVVVVVVVVVVVV 4/16 01:11:33 [18223] found ProxyDelegation 4/16 01:11:33 [18223] GAHP[18272] <- 'RESULTS' 4/16 01:11:33 [18223] GAHP[18272] -> 'R' 4/16 01:11:33 [18223] GAHP[18272] -> 'S' '1' 4/16 01:11:33 [18223] GAHP[18272] -> '3' '0' 'https://160.103.6.173:8443/wsrf/services/DelegationService?a6fb09d0-eba6-11db-b037-fdb6d96494eb' 'NULL' 4/16 01:11:33 [18223] *** checkDelegation() 4/16 01:11:33 [18223] new delegation 4/16 01:11:33 [18223] https://160.103.6.173:8443/wsrf/services/DelegationService?a6fb09d0-eba6-11db-b037-fdb6d96494eb 4/16 01:11:33 [18223] signalling jobs for https://160.103.6.173:8443/wsrf/services/DelegationService?a6fb09d0-eba6-11db-b037-fdb6d96494eb 4/16 01:11:33 [18223] (2684.0) doEvaluateState called: gmState GM_DELEGATE_PROXY, globusState 32 4/16 01:11:33 [18223] *** getDelegationURI(/tmp/x509up_u202) 4/16 01:11:33 [18223] found ProxyDelegation 4/16 01:11:33 [18223] (2684.0) gm state change: GM_DELEGATE_PROXY -> GM_GENERATE_ID 4/16 01:11:33 [18223] GAHP[18272] <- 'GT4_GENERATE_SUBMIT_ID 5 ' 4/16 01:11:33 [18223] GAHP[18272] -> 'S' 4/16 01:11:33 [18223] GAHP[18272] <- 'RESULTS' 4/16 01:11:33 [18223] GAHP[18272] -> 'R' 4/16 01:11:33 [18223] GAHP[18272] -> 'S' '1' 4/16 01:11:33 [18223] GAHP[18272] -> '5' 'uuid:a7108da0-eba6-11db-8c87-828f9e54ef05' 4/16 01:11:33 [18223] (2684.0) doEvaluateState called: gmState GM_GENERATE_ID, globusState 32 4/16 01:11:33 [18223] (2684.0) gm state change: GM_GENERATE_ID -> GM_SUBMIT_ID_SAVE 4/16 01:11:33 [18223] in doContactSchedd() 4/16 01:11:33 [18223] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 4/16 01:11:33 [18223] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 4/16 01:11:33 [18223] querying for removed/held jobs 4/16 01:11:33 [18223] Using constraint ((Owner=?="klotz"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 4/16 01:11:33 [18223] Fetched 0 job ads from schedd 4/16 01:11:33 [18223] Updating classad values for 2684.0: 4/16 01:11:33 [18223] GridftpUrlBase = "gsiftp://gtx02.esrf.fr" 4/16 01:11:33 [18223] GlobusDelegationUri = "https://160.103.6.173:8443/wsrf/services/DelegationService?a6fb09d0-eba6-11db-b037-fdb6d96494eb" 4/16 01:11:33 [18223] GlobusSubmitId = "uuid:a7108da0-eba6-11db-8c87-828f9e54ef05" 4/16 01:11:33 [18223] leaving doContactSchedd() 4/16 01:11:33 [18223] (2684.0) doEvaluateState called: gmState GM_SUBMIT_ID_SAVE, globusState 32 4/16 01:11:33 [18223] (2684.0) gm state change: GM_SUBMIT_ID_SAVE -> GM_SUBMIT 4/16 01:11:33 [18223] GAHP[18272] <- 'USE_CACHED_PROXY 1' 4/16 01:11:33 [18223] GAHP[18272] -> 'S' 4/16 01:11:33 [18223] GAHP[18272] <- 'GT4_GRAM_JOB_SUBMIT 6 uuid:a7108da0-eba6-11db-8c87-828f9e54ef05 https://gtx02.esrf.fr:8443/wsrf/services/ManagedJobFactoryService Condor 1 <job><executable>/users/klotz/pybench/pybench.py</executable><directory>/${GLOBUS_SCRATCH_DIR}/job_a7108da0-eba6-11db-8c87-828f9e54ef05/</directory><argument>-n</argument><argument>100</argument><stdout>/${GLOBUS_SCRATCH_DIR}/job_a7108da0-eba6-11db-8c87-828f9e54ef05//2684.0.pybench.globus.out</stdout><stderr>/${GLOBUS_SCRATCH_DIR}/job_a7108da0-eba6-11db-8c87-828f9e54ef05//2684.0.pybench.globus.err</stderr><fileStageIn><maxAttempts>5</maxAttempts><transferCredentialEndpoint\ xsi:type="ns1:EndpointReferenceType"\ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\ xmlns:ns1="http://schemas.xmlsoap.org/ws/2004/03/addressing"><ns1:Address\ xsi:type="ns1:AttributedURI">https://160.103.6.173:8443/wsrf/services/DelegationService</ns1:Address><ns1:ReferenceProperties\ xsi:type="ns1:ReferencePropertiesType"><ns1:DelegationKey\ xmlns:ns1="http://www.globus.org/08/2004/delegationService">a6fb09d0-eba6-11db-b037-fdb6d96494eb</ns1:DelegationKey></ns1:ReferenceProperties><ns1:ReferenceParameters\ xsi:type="ns1:ReferenceParametersType"/></transferCredentialEndpoint><transfer><sourceUrl>gsiftp://gtx02.esrf.fr/tmp/condor_g_scratch.0x85c9368.2988/empty_dir_u202/</sourceUrl><destinationUrl>file:///${GLOBUS_SCRATCH_DIR}</destinationUrl></transfer><transfer><sourceUrl>gsiftp://gtx02.esrf.fr/tmp/condor_g_scratch.0x85c9368.2988/empty_dir_u202/</sourceUrl><destinationUrl>file:///${GLOBUS_SCRATCH_DIR}/job_a7108da0-eba6-11db-8c87-828f9e54ef05/</destinationUrl></transfer></fileStageIn><fileStageOut><maxAttempts>5</maxAttempts><transferCredentialEndpoint\ xsi:type="ns1:EndpointReferenceType"\ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\ xmlns:ns1="http://schemas.xmlsoap.org/ws/2004/03/addressing"><ns1:Address\ xsi:type="ns1:AttributedURI">https://160.103.6.173:8443/wsrf/services/DelegationService</ns1:Address><ns1:ReferenceProperties\ xsi:type="ns1:ReferencePropertiesType"><ns1:DelegationKey\ xmlns:ns1="http://www.globus.org/08/2004/delegationService">a6fb09d0-eba6-11db-b037-fdb6d96494eb</ns1:DelegationKey></ns1:ReferenceProperties><ns1:ReferenceParameters\ xsi:type="ns1:ReferenceParametersType"/></transferCredentialEndpoint><transfer><sourceUrl>file:///${GLOBUS_SCRATCH_DIR}/job_a7108da0-eba6-11db-8c87-828f9e54ef05/2684.0.pybench.globus.out</sourceUrl><destinationUrl>gsiftp://gtx02.esrf.fr/users/klotz/tmp/2684.0.pybench.globus.out</destinationUrl></transfer><transfer><sourceUrl>file:///${GLOBUS_SCRATCH_DIR}/job_a7108da0-eba6-11db-8c87-828f9e54ef05/2684.0.pybench.globus.err</sourceUrl><destinationUrl>gsiftp://gtx02.esrf.fr/users/klotz/tmp/2684.0.pybench.globus.err</destinationUrl></transfer></fileStageOut><fileCleanUp><transferCredentialEndpoint\ xsi:type="ns1:EndpointReferenceType"\ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\ xmlns:ns1="http://schemas.xmlsoap.org/ws/2004/03/addressing"><ns1:Address\ xsi:type="ns1:AttributedURI">https://160.103.6.173:8443/wsrf/services/DelegationService</ns1:Address><ns1:ReferenceProperties\ xsi:type="ns1:ReferencePropertiesType"><ns1:DelegationKey\ xmlns:ns1="http://www.globus.org/08/2004/delegationService">a6fb09d0-eba6-11db-b037-fdb6d96494eb</ns1:DelegationKey></ns1:ReferenceProperties><ns1:ReferenceParameters\ xsi:type="ns1:ReferenceParametersType"/></transferCredentialEndpoint><deletion><file>file:///${GLOBUS_SCRATCH_DIR}/job_a7108da0-eba6-11db-8c87-828f9e54ef05/</file></deletion></fileCleanUp><jobCredentialEndpoint\ xsi:type="ns1:EndpointReferenceType"\ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\ xmlns:ns1="http://schemas.xmlsoap.org/ws/2004/03/addressing"><ns1:Address\ xsi:type="ns1:AttributedURI">https://160.103.6.173:8443/wsrf/services/DelegationService</ns1:Address><ns1:ReferenceProperties\ xsi:type="ns1:ReferencePropertiesType"><ns1:DelegationKey\ xmlns:ns1="http://www.globus.org/08/2004/delegationService">a6fb09d0-eba6-11db-b037-fdb6d96494eb</ns1:DelegationKey></ns1:ReferenceProperties><ns1:ReferenceParameters\ xsi:type="ns1:ReferenceParametersType"/></jobCredentialEndpoint><stagingCredentialEndpoint\ xsi:type="ns1:EndpointReferenceType"\ xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"\ xmlns:ns1="http://schemas.xmlsoap.org/ws/2004/03/addressing"><ns1:Address\ xsi:type="ns1:AttributedURI">https://160.103.6.173:8443/wsrf/services/DelegationService</ns1:Address><ns1:ReferenceProperties\ xsi:type="ns1:ReferencePropertiesType"><ns1:DelegationKey\ xmlns:ns1="http://www.globus.org/08/2004/delegationService">a6fb09d0-eba6-11db-b037-fdb6d96494eb</ns1:DelegationKey></ns1:ReferenceProperties><ns1:ReferenceParameters\ xsi:type="ns1:ReferenceParametersType"/></stagingCredentialEndpoint><holdState>StageIn</holdState></job> NULL' 4/16 01:11:33 [18223] GAHP[18272] -> 'S' 4/16 01:11:36 [18223] GAHP[18272] <- 'RESULTS' 4/16 01:11:36 [18223] GAHP[18272] -> 'R' 4/16 01:11:36 [18223] GAHP[18272] -> 'S' '1' 4/16 01:11:36 [18223] GAHP[18272] -> '6' '0' 'https://160.103.6.173:8443/wsrf/services/ManagedExecutableJobService?a7108da0-eba6-11db-8c87-828f9e54ef05' 'NULL' 4/16 01:11:36 [18223] (2684.0) doEvaluateState called: gmState GM_SUBMIT, globusState 32 4/16 01:11:36 [18223] (2684.0) gm state change: GM_SUBMIT -> GM_SUBMIT_SET_LIFETIME 4/16 01:11:36 [18223] Starting sent lease 4/16 01:11:36 [18223] *** (2684.0) CalculateLease: new lease should expire at 1176721896 4/16 01:11:36 [18223] GAHP[18272] <- 'GT4_SET_TERMINATION_TIME 7 https://160.103.6.173:8443/wsrf/services/ManagedExecutableJobService?a7108da0-eba6-11db-8c87-828f9e54ef05 43200' 4/16 01:11:36 [18223] GAHP[18272] -> 'S' 4/16 01:11:36 [18223] GAHP[18272] <- 'RESULTS' 4/16 01:11:36 [18223] GAHP[18272] -> 'R' 4/16 01:11:36 [18223] GAHP[18272] -> 'S' '1' 4/16 01:11:36 [18223] GAHP[18272] -> '7' '0' '1176721896' 'NULL' 4/16 01:11:36 [18223] (2684.0) doEvaluateState called: gmState GM_SUBMIT_SET_LIFETIME, globusState 32 4/16 01:11:36 [18223] (2684.0) UpdateJobLeaseSent(1176721896) 4/16 01:11:36 [18223] (2684.0) SetJobLeaseTimers() ....and so on...... and here the output where the job is never submitted to WS-GRAM: VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV 4/16 00:51:00 [16900] found ProxyDelegation 4/16 00:51:07 [16900] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 25846 is alive. 4/16 00:51:52 [16900] Received CHECK_LEASES signal 4/16 00:51:52 [16900] Evaluating periodic job policy expressions. 4/16 00:51:52 [16900] TOUCH_LOG_INTERVAL is undefined, using default value of 60 4/16 00:51:52 [16900] in doContactSchedd() 4/16 00:51:52 [16900] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 4/16 00:51:52 [16900] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 4/16 00:51:52 [16900] querying for renewed leases 4/16 00:51:52 [16900] querying for removed/held jobs 4/16 00:51:52 [16900] Using constraint ((Owner=?="klotz"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 4/16 00:51:52 [16900] Fetched 0 job ads from schedd 4/16 00:51:52 [16900] leaving doContactSchedd() 4/16 00:51:55 [16900] GAHP[16901] <- 'RESULTS' 4/16 00:51:55 [16900] GAHP[16901] -> 'S' '0' 4/16 00:52:52 [16900] Received CHECK_LEASES signal 4/16 00:52:52 [16900] Evaluating periodic job policy expressions. 4/16 00:52:52 [16900] TOUCH_LOG_INTERVAL is undefined, using default value of 60 4/16 00:52:52 [16900] in doContactSchedd() 4/16 00:52:52 [16900] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 4/16 00:52:52 [16900] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 4/16 00:52:52 [16900] querying for renewed leases 4/16 00:52:52 [16900] querying for removed/held jobs 4/16 00:52:52 [16900] Using constraint ((Owner=?="klotz"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 4/16 00:52:52 [16900] Fetched 0 job ads from schedd 4/16 00:52:52 [16900] leaving doContactSchedd() 4/16 00:52:55 [16900] GAHP[16901] <- 'RESULTS' 4/16 00:52:55 [16900] GAHP[16901] -> 'S' '0' 4/16 00:53:07 [16900] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 25846 is alive. 4/16 00:53:52 [16900] Received CHECK_LEASES signal 4/16 00:53:52 [16900] Evaluating periodic job policy expressions. 4/16 00:53:52 [16900] TOUCH_LOG_INTERVAL is undefined, using default value of 60 4/16 00:53:52 [16900] in doContactSchedd() 4/16 00:53:52 [16900] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 4/16 00:53:52 [16900] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 4/16 00:53:52 [16900] querying for renewed leases 4/16 00:53:52 [16900] querying for removed/held jobs 4/16 00:53:52 [16900] Using constraint ((Owner=?="klotz"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 4/16 00:53:52 [16900] Fetched 0 job ads from schedd 4/16 00:53:52 [16900] leaving doContactSchedd() 4/16 00:53:55 [16900] GAHP[16901] <- 'RESULTS' 4/16 00:53:55 [16900] GAHP[16901] -> 'S' '0' 4/16 00:54:52 [16900] Getting monitoring info for pid 16900 4/16 00:54:52 [16900] Received CHECK_LEASES signal 4/16 00:54:52 [16900] Evaluating periodic job policy expressions. 4/16 00:54:52 [16900] TOUCH_LOG_INTERVAL is undefined, using default value of 60 4/16 00:54:52 [16900] in doContactSchedd() 4/16 00:54:52 [16900] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 4/16 00:54:52 [16900] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 4/16 00:54:52 [16900] querying for renewed leases 4/16 00:54:52 [16900] querying for removed/held jobs 4/16 00:54:52 [16900] Using constraint ((Owner=?="klotz"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 4/16 00:54:52 [16900] Fetched 0 job ads from schedd 4/16 00:54:52 [16900] leaving doContactSchedd() 4/16 00:54:55 [16900] GAHP[16901] <- 'RESULTS' 4/16 00:54:55 [16900] GAHP[16901] -> 'S' '0' 4/16 00:55:07 [16900] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 25846 is alive. 4/16 00:55:52 [16900] Received CHECK_LEASES signal 4/16 00:55:52 [16900] Evaluating periodic job policy expressions. 4/16 00:55:52 [16900] TOUCH_LOG_INTERVAL is undefined, using default value of 60 4/16 00:55:52 [16900] in doContactSchedd() 4/16 00:55:52 [16900] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 4/16 00:55:52 [16900] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 4/16 00:55:52 [16900] querying for renewed leases 4/16 00:55:52 [16900] querying for removed/held jobs 4/16 00:55:52 [16900] Using constraint ((Owner=?="klotz"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 4/16 00:55:52 [16900] Fetched 0 job ads from schedd 4/16 00:55:52 [16900] leaving doContactSchedd() 4/16 00:55:55 [16900] GAHP[16901] <- 'RESULTS' 4/16 00:55:55 [16900] GAHP[16901] -> 'S' '0' 4/16 00:55:57 [16900] *** checkDelegation() 4/16 00:55:57 [16900] new delegation 4/16 00:55:58 [16900] *** checkDelegation() 4/16 00:55:58 [16900] new delegation 4/16 00:55:58 [16900] delegate_credentials(https://gtx01.esrf.fr:8443/wsrf/services/DelegationFactoryService) failed! 4/16 00:56:52 [16900] Received CHECK_LEASES signal 4/16 00:56:52 [16900] Evaluating periodic job policy expressions. 4/16 00:56:52 [16900] TOUCH_LOG_INTERVAL is undefined, using default value of 60 4/16 00:56:52 [16900] in doContactSchedd() 4/16 00:56:52 [16900] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 4/16 00:56:52 [16900] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 4/16 00:56:52 [16900] querying for renewed leases 4/16 00:56:52 [16900] querying for removed/held jobs 4/16 00:56:52 [16900] Using constraint ((Owner=?="klotz"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) 4/16 00:56:52 [16900] Fetched 0 job ads from schedd 4/16 00:56:52 [16900] leaving doContactSchedd() 4/16 00:56:55 [16900] GAHP[16901] <- 'RESULTS' 4/16 00:56:55 [16900] GAHP[16901] -> 'S' '0' 4/16 00:57:07 [16900] DaemonCore::IsPidAlive(): kill returned EPERM, assuming pid 25846 is alive. 4/16 00:57:52 [16900] Received CHECK_LEASES signal 4/16 00:57:52 [16900] Evaluating periodic job policy expressions. 4/16 00:57:52 [16900] TOUCH_LOG_INTERVAL is undefined, using default value of 60 4/16 00:57:52 [16900] in doContactSchedd() 4/16 00:57:52 [16900] GRIDMANAGER_TIMEOUT_MULTIPLIER is undefined, using default value of 0 4/16 00:57:52 [16900] SEC_DEBUG_PRINT_KEYS is undefined, using default value of False 4/16 00:57:52 [16900] querying for renewed leases 4/16 00:57:52 [16900] querying for removed/held jobs 4/16 00:57:52 [16900] Using constraint ((Owner=?="klotz"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External")) ...and so on for ever ..... condor_q -globus on the machine where the job stays in state 'I' for ever shows "UNSUBMITTED". Does anybody know why it works on one machine and not on the other? Can anybody tell me how gahp communicates with WS-GRAM? or get more logging from gahp? I would be glad to get a hint where to look for to settle this problem. Cheers..... --
Dr.W-D Klotz - Europ. Synch. Rad. Facility (ESRF) - 6 r Jules Horowitz, BP 220, 38043 Grenoble, FRANCE work: +33(0)4.76.88.29.21 fax:...24.27 mobile: +33(0)6.87.38.59.27 mail: wdklotz@xxxxxxxxx or klotz@xxxxxxx chat: skype Please avoid sending me Word(.doc) or PowerPoint(.ppt) attachments. |