Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Problem with periodic_release and globus_resubmit
- Date: Thu, 12 Feb 2009 10:13:55 -0800
- From: Patrick Armstrong <patricka@xxxxxxx>
- Subject: [Condor-users] Problem with periodic_release and globus_resubmit
Hi there.
I've been using condor to submit jobs to gt4 resources, and I'd like
condor to resubmit jobs to a different resource when they fail. To
test this, I set up three Globus resources. One is deliberately
broken, so jobs sent there will always fail, and two resources are good.
I've been using the condor-g documentation as a guide, and I've got it
working for the most part with a combination of periodic_release,
globus_resubmit, and lastmatchname, but I always seem to have one or
two jobs get stuck in the idle state. I can give the final job a nudge
by submitting another job.
My periodic_hold and globus_resubmit expressions are as follows:
PeriodicRelease = (NumSystemHolds >= NumJobMatches) &&
(NumGlobusSubmits < 4) && (HoldReason != "via condor_hold (by user
(USER))") && ((CurrentTime - EnteredCurrentStatus) > 60)
GlobusResubmit = (GridJobStatus =?= UNDEFINED) && (NumSystemHolds >
NumJobMatches)
Now, having submitted 20 jobs, all but two have completed
successfully. These two jobs are in the idle state, but they seem to
match the classad expression in my GlobusResubmit expression:
[root@ms-gavia-testing ~]# condor_q -constraint "(GridJobStatus =?=
UNDEFINED) && (NumSystemHolds > NumJobMatches)"
-- Submitter: ms-gavia-testing.phys.UVic.CA : <142.104.63.16:64055> :
ms-gavia-testing.phys.UVic.CA
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2269.0 dev07 2/10 16:38 0+00:00:00 I 0 0.0 run-run.sh
2273.0 dev07 2/10 16:38 0+00:00:00 I 0 0.0 run-run.sh
Here is an example of the logs the Negotiator is printing while in
this state:
2/11 10:09:20 ---------- Started Negotiation Cycle ----------
2/11 10:09:20 Phase 1: Obtaining ads from collector ...
2/11 10:09:20 Getting all public ads ...
2/11 10:09:20 Sorting 8 ads ...
2/11 10:09:20 Getting startd private ads ...
2/11 10:09:20 Got ads: 8 public and 1 private
2/11 10:09:20 Public ads include 1 submitter, 4 startd
2/11 10:09:20 Phase 2: Performing accounting ...
2/11 10:09:20 Phase 3: Sorting submitter ads by priority ...
2/11 10:09:20 Phase 4.1: Negotiating with schedds ...
2/11 10:09:20 Negotiating with dev07@xxxxxxxxxxxx at
<142.104.63.16:52155>
2/11 10:09:20 0 seconds so far
2/11 10:09:20 Got NO_MORE_JOBS; done negotiating
2/11 10:09:20 ---------- Finished Negotiation Cycle ----------
Why aren't these two jobs being rescheduled, and why does submitting
another job get them scheduled? I've also attached an example of a
full job description here: http://pastie.org/386200.txt
Any pointers would be very helpful.
--patrick