Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Condor-C Grid Resource - multiple grid resources - one resource down
- Date: Tue, 19 Aug 2014 01:32:16 +0000
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: [HTCondor-users] Condor-C Grid Resource - multiple grid resources - one resource down
Hi All
Some more testing of Condor-C grid resource stuff.
I can specify multiple grid resources OK, as well as limit the number
of jobs submitted to each resource.
Submit file (excerpt) on originating schedd:
universe = grid
resource_name = condor $RANDOM_CHOICE(condorsubmit1.csiro.au, condorsubmit2.csiro.au, \
condorsubmit3.csiro.au, condorsubmit4.csiro.au, \
condorsubmit5.csiro.au, condorsubmit6.csiro.au) \
condor-centralmanager.csiro.au
Config_file (excerpt) on originating schedd:
GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE = 5
With this in place the jobs submitted from the originating schedd are nicely
spread across the (in this example) 6 grid resources, which are 6 other
remote schedds. When the job limit is also used then the jobs are nicely
fed to each remote schedd and kept at max 5 (in this example).
If I deliberately disable one of the 6 remote schedds though, the gridmanager
notices and logs that the resource is down, but how can I tell it to retry
on another grid resource?
I thought of using periodic_hold and periodic_release for a job that's been in
the Idle state for > say 30mins but this will not work as the grid_resource in the
job classads has already been generated at submit time using $RANDOM_CHOICE
I was hoping for something a bit more elegant/simple rather than having to run a
separate script running condor_q with a constraint looking for jobs idle > 30mins,
extracting the job cluster.process numbers, and looping through each using
condor_qedit to modify the GridResource job classad.
Thanks for any info/help.
Cheers
Greg