Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Jobs repeatedly evicted after 30 mins
- Date: Mon, 13 Mar 2006 15:08:09 +0800
- From: <Greg.Hitchen@xxxxxxxx>
- Subject: Re: [Condor-users] Jobs repeatedly evicted after 30 mins
Thanks Jaime (and others who replied).
The problem was indeed to do with UDP being blocked.
Our original 9600-9700 port range was expanded to
9000-10000 but ONLY for TCP. When it was CORRECTLY done
to include UDP the eviction problems disappeared.
Thanks.
Cheers
Greg
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Jaime Frey
> Sent: Friday, 3 March 2006 4:05 AM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Jobs repeatedly evicted after 30 mins
>
>
> On Mar 1, 2006, at 9:19 PM, <Greg.Hitchen@xxxxxxxx>
> <Greg.Hitchen@xxxxxxxx> wrote:
>
> > We have the situation where a user submits ~10 jobs,
> > all of which should run for ~5 hours. Many/most of
> > them get repeatedly evicted after 30 mins and requeued.
> > Below are the relevent logs from the submitting and execute
> machines
> > for one particular instance.
> >
> > I have tested this myself with different jobs and the eviction is
> > ALWAYS ALMOST EXACTLY a few seconds (20?) under 30 minutes.
> >
> > The line in the START LOG:
> >
> > 3/1 05:57:16 State change: claim timed out (condor_schedd gone?)
> >
> > seems to be the relevant one?
> >
> > ALL of the evictions (for different execute machines and different
> > jobs, same submit machine) occur at 30 minutes.
>
> While a job is running, the schedd periodically sends an alive
> message to the startd via UDP. If the startd doesn't receive any
> alive messages for a while, it will kill the claim (and the
> job). The
> default is for the schedd to send an alive every 5 minutes and the
> startd will kill the job if it misses 6 alives, which matches
> your 30
> minutes.
>
> So it appears that UDP packets aren't making it from your submit
> machine to your execute machines.
>
> +--------------------------------+-----------------------------------+
> | Jaime Frey | I used to be a heavy gambler. |
> | jfrey@xxxxxxxxxxx | But now I just make mental bets. |
> | http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind. |
> +--------------------------------+-----------------------------------+
>
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> https://lists.cs.wisc.edu/mailman/listinfo/condor-users
>