On 27 Sep 2006, at 16:37, Jaime Frey wrote:
snip...
Jaime, Thanks for the info - it turned out that this was a firewall issue resolved by moving my tests to a new pair of machines. However, I have now run up against a new problem. (I'm now submitting from a 6.8.1 condor machine to a gatekeeper running globus 4.0.2 in front of a 6.8.1 condor pool; firewalls between the two machines have been set to allow any traffic in either direction free access). I have simplified my script a bit too in order to try and work out what is going on - all I want to see is the hostname of the execute node on the remote condor pool: Universe = grid grid_resource = gt4 cete.niees.group.cam.ac.uk Condor Executable = /bin/hostname Notification = NEVER Output = host_$(PROCESS).out Error = host.err Log = host.log Queue 1 Again the job enters the local queue, the gridftp server starts up and then the job fails and enters the held state. This time I have a different error in the log (Globus error: Staging error for RSL element fileStageIn): 000 (192.000.000) 09/29 16:31:55 Job submitted from host: <131.111.20.163:9661> ... 017 (192.000.000) 09/29 16:32:50 Job submitted to Globus RM-Contact: cete.niees.group.cam.ac.uk JM-Contact: https://128.232.232.28:8443/wsrf/services/ManagedExecutableJobService?b8486b60-4fcf-11db-ba9e-8b423672fa7f Can-Restart-JM: 0 ... 027 (192.000.000) 09/29 16:32:50 Job submitted to grid resource GridResource: gt4 cete.niees.group.cam.ac.uk Condor GridJobId: gt4 https://128.232.232.28:8443/wsrf/services/ManagedExecutableJobService?b8486b60-4fcf-11db-ba9e-8b423672fa7f ... 012 (192.000.000) 09/29 16:32:53 Job was held. Globus error: Staging error for RSL element fileStageIn. Code 0 Subcode 0 ... However, running the equivalent command using the globus client works (and the returned output file shows that the job ran on a condor execute node): globusrun-ws -streaming -stdout-file testout -submit -job-delegate -factory cete.niees.group.cam.ac.uk -factory-type Condor -job-command /bin/hostname Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:3cc015da-4faa-11db-8c27-00042388e7a7 Termination time: 09/30/2006 11:04 GMT Current job state: Pending Current job state: Active Current job state: CleanUp-Hold Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. Using condor's GT2 interface also works as expected: Universe = grid grid_resource = gt2 cete.niees.group.cam.ac.uk/jobmanager-condor Executable = /bin/hostname Notification = NEVER Output = host_$(PROCESS).out Error = host.err Log = host.log Queue 1 And I see exactly the same behavior replacing all the condor jobmanager commands with fork commands. Again I'm after some help finding a starting place for debugging. Does anybody have any idea where to start? Cheers, Andrew Dr Andrew Walker Department of Earth Sciences University of Cambridge Downing Street Cambridge CB2 3EQ UK phone +44 (0)1223 333432 |