Re: [Condor-users] gt4 grid universe problem

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Fri, 29 Sep 2006 16:46:49 +0100

From: Andrew Walker <amw75@xxxxxxxxx>

Subject: Re: [Condor-users] gt4 grid universe problem

On 27 Sep 2006, at 16:37, Jaime Frey wrote:

On Sep 26, 2006, at 8:30 AM, Andrew Walker wrote:

Having recently upgraded from condor 6.6 to 6.8(.0), I'm trying to submit a grid universe gt4 job to a remote gatekeeper in front of a condor pool. Currently my job is failing with the error "Failed to create proxy delegation" (which is Code 0 Subcode 0 in the user log file). Does anybody have any idea how to debug this?

The gatekeeper is running globus 4.0.1 and I can successfully submit jobs using the pre-WS gram (using both the gt2 grid universe and the globus universe). At the moment I have pre-staged the executable and am not attempting to recover the output back to the submit machine - all I want to do is run a shell script on a condor node and return the output to the gatekeeper. I think my problem is with the condor-g submit machine, but I have access to log and configuration files at both ends.

snip...

One possibility is that gridftp is not correctly traversing the firewalls between the gatekeeper and the condor submit machine (I have two firewalls to worry about - both filter traffic in both directions). What are the network requirements for a gt4 resource? I guess the gatekeeper has to connect back to the submitting machine on TCP port 2811. However, I don't think this is the immediate problem as I'm not seeing any activity (or failing outbound network connections) from the gatekeeper.

The problem is not with the gridftp server, but with delegating your proxy to the Delegation service on the gatekeeper machine. The best way to debug this is to try Globus' WS GRAM client to submit an equivalent job. Try this:

globusrun-ws -submit -job-delegate -factory cartman.niees.group.cam.ac.uk -factory-type Condor -job-command /bin/date

This will delegate a credential, then submit a job that uses that credential. If this fails, then you know that the problem is not related to Condor-G.

A couple other notes:

The 'globus_rsl' attribute doesn't work for WS GRAM jobs. Instead, there's a globus_xml attribute, for use with WS GRAM's XML-based RSL description.

The gridftp server Condor-G starts up for WS GRAM file transfers listens on a dynamic port, not 2811. If you have a hole in your firewall and LOWPORT/HIGHPORT set appropriately in your Condor config file, then the gridftp server shouldn't have any problems.

Jaime,

Thanks for the info - it turned out that this was a firewall issue resolved by moving my tests to a new pair of machines. However, I have now run up against a new problem. (I'm now submitting from a 6.8.1 condor machine to a gatekeeper running globus 4.0.2 in front of a 6.8.1 condor pool; firewalls between the two machines have been set to allow any traffic in either direction free access).

I have simplified my script a bit too in order to try and work out what is going on - all I want to see is the hostname of the execute node on the remote condor pool:

Universe = grid

grid_resource = gt4 cete.niees.group.cam.ac.uk Condor

Executable = /bin/hostname

Notification = NEVER

Output = host_$(PROCESS).out

Error = host.err

Log = host.log

Queue 1

Again the job enters the local queue, the gridftp server starts up and then the job fails and enters the held state. This time I have a different error in the log (Globus error: Staging error for RSL element fileStageIn):

000 (192.000.000) 09/29 16:31:55 Job submitted from host: <131.111.20.163:9661>

...

017 (192.000.000) 09/29 16:32:50 Job submitted to Globus

RM-Contact: cete.niees.group.cam.ac.uk

JM-Contact: https://128.232.232.28:8443/wsrf/services/ManagedExecutableJobService?b8486b60-4fcf-11db-ba9e-8b423672fa7f

Can-Restart-JM: 0

...

027 (192.000.000) 09/29 16:32:50 Job submitted to grid resource

GridResource: gt4 cete.niees.group.cam.ac.uk Condor

GridJobId: gt4 https://128.232.232.28:8443/wsrf/services/ManagedExecutableJobService?b8486b60-4fcf-11db-ba9e-8b423672fa7f

...

012 (192.000.000) 09/29 16:32:53 Job was held.

Globus error: Staging error for RSL element fileStageIn.

Code 0 Subcode 0

...

However, running the equivalent command using the globus client works (and the returned output file shows that the job ran on a condor execute node):

globusrun-ws -streaming -stdout-file testout -submit -job-delegate -factory cete.niees.group.cam.ac.uk -factory-type Condor -job-command /bin/hostname

Delegating user credentials...Done.

Submitting job...Done.

Job ID: uuid:3cc015da-4faa-11db-8c27-00042388e7a7

Termination time: 09/30/2006 11:04 GMT

Current job state: Pending

Current job state: Active

Current job state: CleanUp-Hold

Current job state: CleanUp

Current job state: Done

Destroying job...Done.

Cleaning up any delegated credentials...Done.

Using condor's GT2 interface also works as expected:

Universe = grid

grid_resource = gt2 cete.niees.group.cam.ac.uk/jobmanager-condor

Executable = /bin/hostname

Notification = NEVER

Output = host_$(PROCESS).out

Error = host.err

Log = host.log

Queue 1

And I see exactly the same behavior replacing all the condor jobmanager commands with fork commands. Again I'm after some help finding a starting place for debugging. Does anybody have any idea where to start?

Cheers,

Andrew

Dr Andrew Walker

Department of Earth Sciences

University of Cambridge

Downing Street

Cambridge

CB2 3EQ

phone +44 (0)1223 333432

Mailing List Archives

Authenticated access

Re: [Condor-users] gt4 grid universe problem