Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] condor_rooster failing to crow
- Date: Tue, 12 Jan 2010 16:05:35 +0000
- From: "Smith, Ian" <I.C.Smith@xxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] condor_rooster failing to crow
Hi Dan,
Thanks for the quick reply. I think something is falling down the cracks somewhere.
In the negotiator log I see
01/12 15:48:11 Phase 3: Sorting submitter ads by priority ...
01/12 15:48:11 Phase 4.1: Negotiating with schedds ...
01/12 15:48:11 numSlots = 1
01/12 15:48:11 slotWeightTotal = 1.000000
01/12 15:48:11 pieLeft = 1.000
01/12 15:48:11 NormalFactor = 1.000000
01/12 15:48:11 MaxPrioValue = 0.500000
01/12 15:48:11 NumSubmitterAds = 1
01/12 15:48:11 Negotiating with smithic@xxxxxxxxx at <138.253.100.178:58887>
01/12 15:48:11 0 seconds so far
01/12 15:48:11 Calculating submitter limit with the following parameters
01/12 15:48:11 SubmitterPrio = 0.500000
01/12 15:48:11 SubmitterPrioFactor = 1.000000
01/12 15:48:11 submitterShare = 1.000000
01/12 15:48:11 submitterAbsShare = 1.000000
01/12 15:48:11 submitterLimit = 1.000000
01/12 15:48:11 submitterUsage = 0.000000
01/12 15:48:11 Socket to smithic@xxxxxxxxx (<138.253.100.178:58887>) already in cache, reusing
01/12 15:48:11 Sending SEND_JOB_INFO/eom
01/12 15:48:11 Getting reply from schedd ...
01/12 15:48:11 Got JOB_INFO command; getting classad/eom
01/12 15:48:11 Request 00020.00000:
01/12 15:48:11 matchmakingAlgorithm: limit 1.000000 used 0.000000 pieLeft 1.000000
01/12 15:48:11 Rejected 20.0 smithic@xxxxxxxxx <138.253.100.178:58887>: no match found
01/12 15:48:11 Sending SEND_JOB_INFO/eom
01/12 15:48:11 Getting reply from schedd ...
01/12 15:48:11 Got NO_MORE_JOBS; done negotiating
01/12 15:48:11 Submitter smithic@xxxxxxxxx got all it wants; removing it.
which seems to imply no match but when I use condor_q -ana it gives:
1 match but are currently offline
If I bring the machine on line then the job does indeed run.
any ideas ?
regards,
-ian.
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: 11 January 2010 17:02
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] condor_rooster failing to crow
>
> Ian,
>
> Sorry to hear you are having difficulties. If it is caused by a bug,
> I'll have to eat crow. Here are some things to help see where it might
> be going wrong.
>
> The setting of MachineLastMatchTime is initiated by the negotiator.
> With D_FULLDEBUG turned on, you should see a line like the following in
> your NegotiatorLog:
>
> Registering attempt to match offline machine MACHINE by USER.
>
> This results in a MERGE_STARTD_AD command being sent to the collector.
> If you have D_COMMAND turned on in the collector, you should see that
> command being received in CollectorLog.
>
> After that command has been received, the machine ad should contain
> MachineLastMatchTime. You should be able to see that with condor_status
> -long.
>
> If something overwrites the offline machine ad, then
> MachineLastMatchTime will go away until the next time the negotiator
> sets it (i.e. the next negotiation cycle where a job matches the offline
> machine).
>
> --Dan
>
> Smith, Ian wrote:
> > Dear All,
> >
> > I'm trying to use condor_rooster in Condor 7.4 to work with our Windows XP pool
> > but with only limited success. To keep comaptibility with our current power saving
> > set up I'm trying to avoid using the Condor power saving and intead I'm publishing
> > the ClassAds of offline machine via a cron so that condor_rooster can wake up
> > the relevant ones.
> >
> > The crux of the matter seems to be in the UNHIBERNATE expression. In the
> documentation
> > (p 216) it states that the default value is MachineLastMatchTime =!= UNDEFINED
> although
> > I find that it is atually MY.MachineLastMatchTime =!= UNDEFINED. I've tried both
> and neither
> > seem to work as neither MachineLastMatchTime nor MY.MachineLastMatchTime
> seem
> > to be set. The manual says that
> >
> > "the special attribute MachineLastMatchTime is updated in the ClassAds of offline
> machines
> > when the job would have been matched to the machine if it had been online"
> >
> > but this doesn't seem to be happening. Using condor_q -ana reveals
> >
> > 019.009: Run analysis summary. Of 1 machines,
> > 0 are rejected by your job's requirements
> > 0 reject your job because of their own requirements
> > 0 match but are serving users with a better priority in the pool
> > 0 match but reject the job for unknown reasons
> > 0 match but will not currently preempt their existing job
> > 1 match but are currently offline
> > 0 are available to run your job
> >
> > so the matchmaking is definitely working - it just seems that the machine ClassAd
> isn't
> > updated. If I set MachineLastMatchTime to some arbitrary value myself then
> >
> > ROOSTER_UNHIBERNATE=Offline && Unhibernate
> >
> > seems to evaluate to TRUE and the wake up kicks in.
> >
> > I've tried D_FULLBEBUG but I still can't track down where the problem is.
> >
> > Any ideas ?
> >
> > regards,
> >
> > -ian.
> >
> >
> > --------------------------------------------
> > Dr Ian C. Smith,
> > e-Science Team,
> > The University of Liverpool,
> > Computing Services Departmen
> >
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> >
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/