Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor-G and Globus-RSL

Date: Thu, 06 Jul 2006 23:56:56 +0200
From: Martin Feller <feller@xxxxxxxxxxx>
Subject: [Condor-users] Condor-G and Globus-RSL

I've two question concerning jobs submitted in the Globus-Universe viacondor_submit:


GT: 4.0.2
Condor: 6.7.19

1) Is there a possibility to influence the created Globus-RSL and changesome of the settingscreated by Condor-G and not just insert additional name-value pairsvia globus_xml at the end

  of the XML job description?

Why this question:

I'm doing Throughput-Testing with WS-GRAM and submit 3500 jobs viacondor_submit

to a GT4-Container.

Condor Job description:
####################
Universe        = grid
Grid_Type       = gt4
Jobmanager_Type = Condor
GlobusScheduler = osg-test1.unl.edu:9443
Executable      = mysleep
Arguments       = test_output$(Process) test_input$(Process)
when_to_transfer_output = ON_EXIT
transfer_input_files = test_input
transfer_output_files = test_output$(Process)
Output          = job_sleep_io.output$(Process)
Error           = job_sleep_io.error$(Process)
Log             = job_sleep_io.log
Queue 3500

Sometimes (!) 1-5 of the 3500 jobs keep hanging in state StageInResponseand do not continue.Quite often there are errors during stageIn process. I had a look at theRSL created by condor-g

and  found that there are 4 transfers during stageIn:

Globus Job Description (stageIn-part):
###############################
...
<ns2:fileStageIn>

<ns13:maxAttemptsxmlns:ns13="http://www.globus.org/namespaces/2004/10/rft";>5</ns13:maxAttempts>

<ns14:transferCredentialEndpoint...>
   ...
</ns14:transferCredentialEndpoint>

<ns18:transfer ...>

<ns18:sourceUrl>gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776/empty_dir_u1465/</ns18:sourceUrl>
<ns18:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch</ns18:destinationUrl>
</ns18:transfer>

<ns19:transfer ...><ns19:sourceUrl>gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776/empty_dir_u1465/</ns19:sourceUrl>

<ns19:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/job_8ff6c880-0cc4-11db-b248-a9807d8bba43/</ns19:destinationUrl>
</ns19:transfer>

<ns20:transfer ...>
<ns20:sourceUrl>gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/3500_jobs_2006_07_06_Mxm1024M/mysleep</ns20:sourceUrl>
<ns20:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/job_8ff6c880-0cc4-11db-b248-a9807d8bba43/mysleep</ns20:destinationUrl>
</ns20:transfer>

<ns21:transfer ...>
<ns21:sourceUrl>gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/3500_jobs_2006_07_06_Mxm1024M/test_input</ns21:sourceUrl>
<ns21:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/job_8ff6c880-0cc4-11db-b248-a9807d8bba43/test_input</ns21:destinationUrl>
</ns21:transfer>

</ns2:fileStageIn>
...

I assume the following occurs and is responsible for the errors:

If the container is busy (and it is busy with 3500 jobs) sometimes thefirst two transfers are not finished when the third one starts. In thiscase the directory job_8ff6c880-0cc4-11db-b248-a9807d8bba43 (createdduring transfer two) does not exist

and this results in a staging Exception in the GT4 Container.
The same error occurs for the fourth transfer, but much less frequently.

To reproduce jobs staying in state stageInResponse without continuing inthe GT4 container oreven better to help all jobs to finish reliably, I would like to trysetting maxAttempts to a different

value and have a look, if this value has some impact on these jobs.

2) How is the Globus job-ID (e.g. 8ff6c880-0cc4-11db-b248-a9807d8bba43)created by Condor_G?

I found such an ID in each Globus-RSL and this ID seems to correspond tothe Globus job IDs.Sometimes it occurs that two of the 3500 Condor-jobs are mapped to thesame Globus-ID.

One of these two jobs does not finish successfully.

Thanks in advance for any explanation or advice!

Martin

Prev by Date: Re: [Condor-users] How To TroubleShoot Flocking
Next by Date: Re: [Condor-users] How To TroubleShoot Flocking
Previous by thread: [Condor-users] condor in grid anatomy
Next by thread: [Condor-users] General file stage-in Condor-G
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[Condor-users] Condor-G and Globus-RSL