Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] [Globus-discuss] error submitting jobs to condor pool
- Date: Tue, 12 Jun 2007 22:52:33 +0700
- From: "Nano Surbakti" <nano.surbakti@xxxxxxxxx>
- Subject: Re: [Condor-users] [Globus-discuss] error submitting jobs to condor pool
Hi Martin,
The staging job works :
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
nano@elka-113:~/Experiments/grid$ globusrun-ws -submit -Ft Condor
-streaming -S -f globusmultijob.rsl
Delegating user credentials...Done.
Submitting job...Done.
Job ID: uuid:c65ee45a-18fa-11dc-adad-001676c58b92
Termination time: 06/13/2007 15:37 GMT
Current job state: StageIn
Current job state: Pending
Current job state: Active
Current job state: CleanUp-Hold
Current job state: CleanUp
Current job state: Done
Destroying job...Done.
Cleaning up any delegated credentials...Done.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The corresponding message in container.log
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
2007-06-12 22:37:07,324 INFO exec.StateMachine
[RunQueueThread_9,logJobAccepted:3193] Job
c6effee0-18fa-11dc-aa0c-938be5c4dcca accepted for local user 'nano'
2007-06-12 22:37:21,206 INFO exec.StateMachine
[RunQueueThread_10,logJobSucceeded:3204] Job
c6effee0-18fa-11dc-aa0c-938be5c4dcca finished successfully
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
I think this log corresponds to the same job, since it has the same time;
the Job ID is different, I don't know if it's normal or not.
This is the job file:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
<?xml version="1.0" encoding="UTF-8"?>
<multiJob xmlns:gram="http://www.globus.org/namespaces/2004/10/gram/job"
xmlns:wsa="http://schemas.xmlsoap.org/ws/2004/03/addressing">
<factoryEndpoint>
<wsa:Address>https://elka-113.ee.itb.ac.id:8443/wsrf/services/ManagedJobFactoryService</wsa:Address>
<wsa:ReferenceProperties>
<gram:ResourceID>Multi</gram:ResourceID>
</wsa:ReferenceProperties>
</factoryEndpoint>
<directory>${GLOBUS_USER_HOME}/test</directory>
<count>1</count>
<job>
<factoryEndpoint>
<wsa:Address>https://elka-113.ee.itb.ac.id:8443/wsrf/services/ManagedJobFactoryService</wsa:Address>
<wsa:ReferenceProperties>
<gram:ResourceID>Condor</gram:ResourceID>
</wsa:ReferenceProperties>
</factoryEndpoint>
<executable>/usr/bin/java</executable>
<argument>-classpath</argument>
<argument>.:jai_core.jar:jai_codec.jar</argument>
<argument>Encoder</argument>
<argument>xrayA-00-00.bmp</argument>
<stdout>${GLOBUS_USER_HOME}/target/stdout</stdout>
<stderr>${GLOBUS_USER_HOME}/target/stderr</stderr>
<fileStageIn>
<transfer>
<sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/Encoder.class</sourceUrl>
<destinationUrl>file:///${GLOBUS_USER_HOME}/target/Encoder.class</destinationUrl>
</transfer>
<transfer>
<sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/jai_core.jar</sourceUrl>
<destinationUrl>file:///${GLOBUS_USER_HOME}/target/jai_core.jar</destinationUrl>
</transfer>
<transfer>
<sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/jai_codec.jar</sourceUrl>
<destinationUrl>file:///${GLOBUS_USER_HOME}/target/jai_codec.jar</destinationUrl>
</transfer>
<transfer>
<sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/codebook</sourceUrl>
<destinationUrl>file:///${GLOBUS_USER_HOME}/target/codebook</destinationUrl>
</transfer>
<transfer>
<sourceUrl>gsiftp://elka-113.ee.itb.ac.id:2811/home/nano/test/xrayA-00-00.bmp</sourceUrl>
<destinationUrl>file:///${GLOBUS_USER_HOME}/target/xrayA-00-00.bmp</destinationUrl>
</transfer>
</fileStageIn>
<fileCleanUp>
<deletion><file>file:///${GLOBUS_USER_HOME}/target/Encoder.class</file></deletion>
<deletion><file>file:///${GLOBUS_USER_HOME}/target/jai_core.jar</file></deletion>
<deletion><file>file:///${GLOBUS_USER_HOME}/target/jai_codec.jar</file></deletion>
<deletion><file>file:///${GLOBUS_USER_HOME}/target/codebook</file></deletion>
<deletion><file>file:///${GLOBUS_USER_HOME}/target/xrayA-00-00.bmp</file></deletion>
</fileCleanUp>
<extensions>
<condorsubmit name="universe">Java</condorsubmit>
<condorsubmit name="should_transfer_files">YES</condorsubmit>
<condorsubmit
name="when_to_transfer_output">ON_EXIT_OR_EVICT</condorsubmit>
<condorsubmit name="requirements">Arch == "INTEL" &&
OpSys == "WINNT51" || Arch == "INTEL" && OpSys ==
"LINUX"</condorsubmit>
</extensions>
</job>
</multiJob>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Though it is worked, there's one problem left: while I'm query the
job, using condor_q -better-analyze, it said that the job requirement
Arch == "INTEL" && OpSys == "LINUX", eventhougth I explicitly
said in the job file that I also want it to be executed on WINNT51.
The executor nodes on my Condor pool has 4 Windows machines and only
one Linux machine.
Here's the result of condor_q -better-analyze:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
128.000: Run analysis summary. Of 7 machines,
5 are rejected by your job's requirements
0 reject your job because of their own requirements
0 match but are serving users with a better priority in the pool
2 match but reject the job for unknown reasons
0 match but will not currently preempt their existing job
0 are available to run your job
The Requirements expression for your job is:
( target.OpSys == "LINUX" && target.Arch == "INTEL" ) &&
( target.Disk >= DiskUsage ) && ( ( target.Memory * 1024 ) >= ImageSize ) &&
( TARGET.FileSystemDomain == MY.FileSystemDomain )
Condition Machines Matched Suggestion
--------- ---------------- ----------
1 target.OpSys == "LINUX" 2
2 ( TARGET.FileSystemDomain == "elka-113.ee.itb.ac.id" )
2
3 target.Arch == "INTEL" 7
4 ( target.Disk >= 10000 ) 7
5 ( ( 1024 * target.Memory ) >= 10000 )7
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Actually, I need to run hundred of jobs; so I think it's better to use
condor_submit (Condor-G), right? Can you diagnose why I previously had
problem while submit the jobs using Condor-G ?
Very best regards,
--
Nano Surbakti
On 6/12/07, feller@xxxxxxxxxxx <feller@xxxxxxxxxxx> wrote:
Nano,
see
http://www.globus.org/toolkit/docs/4.0/admin/docbook/quickstart.html#q-gram2
for how to submit a staging job.
From a first look it seems that delegation didn't work.
Please try the globusrun-ws job with staging and send the
the output of the client and the relevant parts of the
container logfile then.
Martin
> Hi Martin,
>
> While I'm reading globusrun-ws manual, here is the container log:
>
> --------------------------------------
> 2007-06-12 10:15:45,455 INFO exec.StateMachine
> [RunQueueThread_4,logJobAccepted:3193] Job
> 333f5540-1893-11dc-bb3f-aec5afd22587 accepted for local user 'nano'
> 2007-06-12 10:15:50,878 ERROR exec.StateMachine
> [RunQueueThread_9,fileCleanUp:2730] A secondary fault occured while
> trying to gracefully fail.
> AxisFault
> faultCode:
> {http://schemas.xmlsoap.org/soap/envelope/}Server.userException
> faultSubcode:
> faultString: java.rmi.RemoteException: Unable to create RFT resource;
> nested exception is:
> org.globus.transfer.reliable.service.exception.RftException: Error
> processing delegated credentialError getting delegation resource
> [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error
> getting delegation resource [Caused by:
> org.globus.wsrf.NoSuchResourceException]]
> faultActor:
> faultNode:
> faultDetail:
> {http://xml.apache.org/axis/}stackTrace:java.rmi.RemoteException:
> Unable to create RFT resource; nested exception is:
> org.globus.transfer.reliable.service.exception.RftException: Error
> processing delegated credentialError getting delegation resource
> [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error
> getting delegation resource [Caused by:
> org.globus.wsrf.NoSuchResourceException]]
> at
> org.globus.transfer.reliable.service.factory.ReliableFileTransferFactoryService.createReliableFileTransfer(ReliableFileTransferFactoryService.java:245)
> at sun.reflect.GeneratedMethodAccessor287.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.axis.providers.java.RPCProvider.invokeMethod(RPCProvider.java:384)
> at
> org.globus.axis.providers.RPCProvider.invokeMethodSub(RPCProvider.java:107)
> at
> org.globus.axis.providers.PrivilegedInvokeMethodAction.run(PrivilegedInvokeMethodAction.java:42)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at org.globus.gsi.jaas.GlobusSubject.runAs(GlobusSubject.java:55)
> at org.globus.gsi.jaas.JaasSubject.doAs(JaasSubject.java:90)
> at
> org.globus.axis.providers.RPCProvider.invokeMethod(RPCProvider.java:97)
> at
> org.apache.axis.providers.java.RPCProvider.processMessage(RPCProvider.java:281)
> at
> org.apache.axis.providers.java.JavaProvider.invoke(JavaProvider.java:319)
> at
> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
> at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
> at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
> at org.apache.axis.handlers.soap.SOAPService.invoke(SOAPService.java:450)
> at org.apache.axis.server.AxisServer.invoke(AxisServer.java:285)
> at org.globus.wsrf.container.ServiceThread.doPost(ServiceThread.java:664)
> at
> org.globus.wsrf.container.ServiceThread.process(ServiceThread.java:382)
> at
> org.globus.wsrf.container.GSIServiceThread.process(GSIServiceThread.java:147)
> at org.globus.wsrf.container.ServiceThread.run(ServiceThread.java:291)
> Caused by: org.globus.transfer.reliable.service.exception.RftException:
> Error processing delegated credentialError getting delegation resource
> [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error
> getting delegation resource [Caused by:
> org.globus.wsrf.NoSuchResourceException]]
> at
> org.globus.transfer.reliable.service.ReliableFileTransferResource.processDelegatedCredential(ReliableFileTransferResource.java:391)
> at
> org.globus.transfer.reliable.service.ReliableFileTransferResource.processDelegatedCredential(ReliableFileTransferResource.java:354)
> at
> org.globus.transfer.reliable.service.ReliableFileTransferHome.create(ReliableFileTransferHome.java:134)
> at
> org.globus.transfer.reliable.service.factory.ReliableFileTransferFactoryService.createReliableFileTransfer(ReliableFileTransferFactoryService.java:235)
> ... 22 more
>
> {http://xml.apache.org/axis/}hostname:hobitton
>
> java.rmi.RemoteException: Unable to create RFT resource; nested exception
> is:
> org.globus.transfer.reliable.service.exception.RftException: Error
> processing delegated credentialError getting delegation resource
> [Caused by: org.globus.wsrf.NoSuchResourceException] [Caused by: Error
> getting delegation resource [Caused by:
> org.globus.wsrf.NoSuchResourceException]]
> at
> org.apache.axis.message.SOAPFaultBuilder.createFault(SOAPFaultBuilder.java:221)
> at
> org.apache.axis.message.SOAPFaultBuilder.endElement(SOAPFaultBuilder.java:128)
> at
> org.apache.axis.encoding.DeserializationContext.endElement(DeserializationContext.java:1087)
> at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
> at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
> at
> org.apache.axis.encoding.DeserializationContext.parse(DeserializationContext.java:227)
> at org.apache.axis.SOAPPart.getAsSOAPEnvelope(SOAPPart.java:645)
> at org.apache.axis.Message.getSOAPEnvelope(Message.java:424)
> at
> org.apache.axis.message.addressing.handler.AddressingHandler.processClientResponse(AddressingHandler.java:305)
> at
> org.apache.axis.message.addressing.handler.AddressingHandler.invoke(AddressingHandler.java:110)
> at
> org.apache.axis.strategies.InvocationStrategy.visit(InvocationStrategy.java:32)
> at org.apache.axis.SimpleChain.doVisiting(SimpleChain.java:118)
> at org.apache.axis.SimpleChain.invoke(SimpleChain.java:83)
> at org.apache.axis.client.AxisClient.invoke(AxisClient.java:190)
> at org.apache.axis.client.Call.invokeEngine(Call.java:2727)
> at org.apache.axis.client.Call.invoke(Call.java:2710)
> at org.apache.axis.client.Call.invoke(Call.java:2386)
> at org.apache.axis.client.Call.invoke(Call.java:2309)
> at org.apache.axis.client.Call.invoke(Call.java:1766)
> at
> org.globus.rft.generated.bindings.ReliableFileTransferFactoryPortTypeSOAPBindingStub.createReliableFileTransfer(ReliableFileTransferFactoryPortTypeSOAPBindingStub.java:874)
> at
> org.globus.exec.service.exec.utils.StagingHelper.submitStagingRequest(StagingHelper.java:168)
> at
> org.globus.exec.service.exec.StateMachine.fileCleanUp(StateMachine.java:2716)
> at
> org.globus.exec.service.exec.StateMachine.processFailureFileCleanUpState(StateMachine.java:2091)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.globus.exec.service.exec.StateMachine.processState(StateMachine.java:302)
> at org.globus.exec.service.exec.RunThread.run(RunThread.java:85)
> 2007-06-12 10:15:51,055 INFO exec.StateMachine
> [RunQueueThread_9,logJobFailed:3212] Job
> 333f5540-1893-11dc-bb3f-aec5afd22587 failed
> --------------------------------------
> This time I only submit one job, to minimize the log/error message.
>
> To make the log complete :) ... here's what Condor log said about the same
> job:
> --------------------------------------
> 017 (096.000.000) 06/12 10:15:50 Job submitted to Globus
> RM-Contact:
> https://167.205.65.113:8443/wsrf/services/ManagedJobFactoryService
> JM-Contact:
> https://167.205.65.113:8443/wsrf/services/ManagedExecutableJobService?333f5540-1893-11dc-bb3f-aec5afd22587
> Can-Restart-JM: 0
> ...
> 027 (096.000.000) 06/12 10:15:50 Job submitted to grid resource
> GridResource: gt4
> https://167.205.65.113:8443/wsrf/services/ManagedJobFactoryService
> Condor
> GridJobId: gt4
> https://167.205.65.113:8443/wsrf/services/ManagedExecutableJobService?333f5540-1893-11dc-bb3f-aec5afd22587
> ...
> 012 (096.000.000) 06/12 10:15:51 Job was held.
> Globus error: Staging error for RSL element fileStageIn.
> Code 0 Subcode 0
> --------------------------------------
>
> Big THANKS !!
>
> --
> Nano Surbakti
>
>
> On 6/12/07, feller@xxxxxxxxxxx <feller@xxxxxxxxxxx> wrote:
>> Ok, what does the server-side GT4 container logfile say?
>> If it's available, please post it to the list.
>> If not: Do you have the Condor's GridmanagerLog?
>> Also: please try to submit a staging job with globusrun-ws
>> (instead of condor-g). What's the output of the client and
>> what does the server-log say (if this fails too)?
>> Martin
>>
>
>