Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Condor-G and Globus-RSL
- Date: Thu, 06 Jul 2006 23:56:56 +0200
- From: Martin Feller <feller@xxxxxxxxxxx>
- Subject: [Condor-users] Condor-G and Globus-RSL
I've two question concerning jobs submitted in the Globus-Universe via
condor_submit:
GT: 4.0.2
Condor: 6.7.19
1) Is there a possibility to influence the created Globus-RSL and change
some of the settings
created by Condor-G and not just insert additional name-value pairs
via globus_xml at the end
of the XML job description?
Why this question:
I'm doing Throughput-Testing with WS-GRAM and submit 3500 jobs via
condor_submit
to a GT4-Container.
Condor Job description:
####################
Universe = grid
Grid_Type = gt4
Jobmanager_Type = Condor
GlobusScheduler = osg-test1.unl.edu:9443
Executable = mysleep
Arguments = test_output$(Process) test_input$(Process)
when_to_transfer_output = ON_EXIT
transfer_input_files = test_input
transfer_output_files = test_output$(Process)
Output = job_sleep_io.output$(Process)
Error = job_sleep_io.error$(Process)
Log = job_sleep_io.log
Queue 3500
Sometimes (!) 1-5 of the 3500 jobs keep hanging in state StageInResponse
and do not continue.
Quite often there are errors during stageIn process. I had a look at the
RSL created by condor-g
and found that there are 4 transfers during stageIn:
Globus Job Description (stageIn-part):
###############################
...
<ns2:fileStageIn>
<ns13:maxAttempts
xmlns:ns13="http://www.globus.org/namespaces/2004/10/rft">5</ns13:maxAttempts>
<ns14:transferCredentialEndpoint...>
...
</ns14:transferCredentialEndpoint>
<ns18:transfer ...>
<ns18:sourceUrl>gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776/empty_dir_u1465/</ns18:sourceUrl>
<ns18:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch</ns18:destinationUrl>
</ns18:transfer>
<ns19:transfer ...>
<ns19:sourceUrl>gsiftp://osg-test2.unl.edu:2811/tmp/condor_g_scratch.0x9fa5438.30776/empty_dir_u1465/</ns19:sourceUrl>
<ns19:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/job_8ff6c880-0cc4-11db-b248-a9807d8bba43/</ns19:destinationUrl>
</ns19:transfer>
<ns20:transfer ...>
<ns20:sourceUrl>gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/3500_jobs_2006_07_06_Mxm1024M/mysleep</ns20:sourceUrl>
<ns20:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/job_8ff6c880-0cc4-11db-b248-a9807d8bba43/mysleep</ns20:destinationUrl>
</ns20:transfer>
<ns21:transfer ...>
<ns21:sourceUrl>gsiftp://osg-test2.unl.edu:2811/home/feller/myTests/3500_jobs_2006_07_06_Mxm1024M/test_input</ns21:sourceUrl>
<ns21:destinationUrl>gsiftp://osg-test1.unl.edu:2811/home/gpn/.globus/scratch/job_8ff6c880-0cc4-11db-b248-a9807d8bba43/test_input</ns21:destinationUrl>
</ns21:transfer>
</ns2:fileStageIn>
...
I assume the following occurs and is responsible for the errors:
If the container is busy (and it is busy with 3500 jobs) sometimes the
first two transfers are not finished when the third one starts. In this
case the directory job_8ff6c880-0cc4-11db-b248-a9807d8bba43 (created
during transfer two) does not exist
and this results in a staging Exception in the GT4 Container.
The same error occurs for the fourth transfer, but much less frequently.
To reproduce jobs staying in state stageInResponse without continuing in
the GT4 container or
even better to help all jobs to finish reliably, I would like to try
setting maxAttempts to a different
value and have a look, if this value has some impact on these jobs.
2) How is the Globus job-ID (e.g. 8ff6c880-0cc4-11db-b248-a9807d8bba43)
created by Condor_G?
I found such an ID in each Globus-RSL and this ID seems to correspond to
the Globus job IDs.
Sometimes it occurs that two of the 3500 Condor-jobs are mapped to the
same Globus-ID.
One of these two jobs does not finish successfully.
Thanks in advance for any explanation or advice!
Martin