Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] eviction problems with 7.4.2
- Date: Thu, 27 May 2010 14:50:26 +0100
- From: "Smith, Ian" <I.C.Smith@xxxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] eviction problems with 7.4.2
Nope not that I can see. Neither seem to be crashing and I believe
this - in startd log - is the keyboard activity event:
5/27 14:43:10 PERMISSION GRANTED to ssl@unmappeduser from host 138.253.103.228 for command 427 (X_EVENT_NOTIFICATION), access level ALLOW: reason:
05/27 14:43:10 Received UDP command 427 (X_EVENT_NOTIFICATION) from ssl@unmappeduser <138.253.103.228:3422>, access level ALLOW
05/27 14:43:10 command_x_event() called.
regards,
-ian.
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-
> bounces@xxxxxxxxxxx] On Behalf Of Robert Rati
> Sent: 27 May 2010 13:32
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] eviction problems with 7.4.2
>
> Is there anything of interest in the Start[er]Log that was not in the
> logs of 7.0.2? Any Startd or Starter crashes?
>
> Rob
>
> Smith, Ian wrote:
> > Dear All,
> >
> > I've recently been taking a look at checkpointing under the vanilla
> > universe*. I had everything working fine using Condor 7.0.2 on
> > the execute hosts (running Win XP SP 3) but when I moved
> > to 7.4.2 there are problems when jobs get evicted.
> >
> > When this happens because of mouse/keyboard activity I see
> > the machine go through the usual Claimed/Busy ->
> > Preempting/Vacating -> Preempting/Killing -> Owner
> > states but the job carries on running according to condor_q
> > (and the log file).
> >
> > If I look on the execute host, then the
> > execute directory has been wiped but condor_q insists that
> > the job is still running. Eventually when the job starts again
> > I see a "job disconnected" error in the job's log file. As
> > well as this, none of the output files get returned to the $(SPOOL)
> > area.
> >
> > The execute hosts have this config:
> >
> > WANT_SUSPEND = FALSE
> > WANT_VACATE = TRUE
> > START = ( $(UWCS_START) && $(OfficeHours) \
> > || ( $(OfficeHours) == FALSE ) && ( $(ShutdownHours) == FALSE ) )
> > SUSPEND = FALSE
> > CONTINUE= $(UWCS_CONTINUE)
> > PREEMPT= $(UWCS_SUSPEND) && $(OfficeHours)
> > KILL= TRUE
> >
> > which worked fine with 7.0.2.
> >
> > Any ideas what may be wrong. Could it be something to do with one
> > of the daemons not receiving a signal from condor_kbdd ?
> >
> > regards,
> >
> > -ian.
> >
> > * I've written up some detailed instructions on this for the benefit of
> > our users. If anyone is interested I'll post the link here.
> >
> > --------------------------------------------
> > Dr Ian C. Smith,
> > e-Science Team,
> > The University of Liverpool,
> > Computing Services Department
> >
> > _______________________________________________
> > Condor-users mailing list
> > To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> > subject: Unsubscribe
> > You can also unsubscribe by visiting
> > https://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> > The archives can be found at:
> > https://lists.cs.wisc.edu/archive/condor-users/
> _______________________________________________
> Condor-users mailing list
> To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/condor-users/