Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tier2-admins] Re: [Condor-users] Unwanted job eviction

Date: Wed, 23 Mar 2005 19:20:06 -0600
From: Andrew Zahn <azahn@xxxxxxxxxxxxxxxx>
Subject: Re: [Tier2-admins] Re: [Condor-users] Unwanted job eviction

Hi All,

It looks as if we ended up using the "condor_config.generic" file by accident from the examples instead of "condor_config.dedicated" which I usually use. This is why we had a sliding scale of priorities. Eventually usatlas1 had the most run jobs so it's priorities dropped very low.

Andrew

On Wed, 23 Mar 2005, Marco Mambelli wrote:

Thank you Dan,
we'll change the configuration here in Chicago to avoing preemption.
I've been checking some StartLog but it is not always clear to me why the job was preempted
Host1
Startlog is available here:
http://grid.uchicago.edu/marco/clogs/cl1-0/StartLog
Job log:
http://grid.uchicago.edu/marco/clogs/job1
Here the user farbin@local preempted the job on vm1 and the user ivdgl@local preempted the job on vm2 (both starting around 3/23 10:34).

These users have no higher privilege. So I don't know why they preempted the running jobs of usatlas1@local A hypothesis (don't know if correct) is that since the user usatlas1 had already many other jobs, it got penalized somehow.
Host2
Startlog is available here:
http://grid.uchicago.edu/marco/clogs/cl0-8/StartLog
Job log:
http://grid.uchicago.edu/marco/clogs/gram_condor_log.31548.1111602079
Here vm1 starting at about 3/23 12:22 has a job request (Got activate_claim request from shadow), is going Idle -> Busy Then keeps changing: at 12:25 Busy -> Suspended (SUSPEND is TRUE) at 12:27 Suspended -> Busy (SUSPEND is TRUE) at 12:52 Busy -> Suspended at 13:02 Claimed/Suspended -> Preempting/Vacating (PREEMPT is TRUE, WANT_VACATE is TRUE), evicting the job and then going back to unclaimed!? 13:02 Preempting/Vacating -> Owner/Idle, Owner -> Unclaimed, Unclaimed -> Owner, back and forth (Owner -> Unclaimed at 13:43, ...)
Here I don't know the reason of the preemption and the previous suspend.
Any suggestion on where to look further?
This was before changing the configuration as pointed in your suggestion.
We just restarted the cluster.
Thanks,
Marco
On Wed, 23 Mar 2005, Dan Bradley wrote:
Marco,
Job eviction can happen for a number of reasons. The best place to see why a job was evicted is in the condor StartLog on the machine where the eviction took place.

Here is a section in the manual that may be helpful to you in configuring your policy to avoid preemption:
For 6.6:
http://www.cs.wisc.edu/condor/manual/v6.6/3_6Startd_Policy.html#SECTION00469500000000000000
For 6.7:
http://www.cs.wisc.edu/condor/manual/v6.7/3_6Startd_Policy.html#SECTION00469500000000000000
--Dan
Marco Mambelli wrote:
Hi, in a dedicated cluster running Condor 6.6.8 many jobs get evicted from a node and restarted shortly after on another node. The jobs cannot be checkpointed and they are either restarting from scratch or sometime failing, if data recorded in a NFS mounted job directory by the previous attempt is inconsistent. Is it there any way to understand why? Is it possible to disable this behavior? Below is an example of condor job log of one of the evicted jobs.
Thanks,
Marco
000 (7707.000.000) 03/23 01:58:15 Job submitted from host: <10.255.255.254:32806> ... 001 (7707.000.000) 03/23 01:58:18 Job executing on host: <10.255.255.216:32811> ... 006 (7707.000.000) 03/23 01:58:26 Image size of job updated: 25884 ... 006 (7707.000.000) 03/23 02:18:26 Image size of job updated: 541852 ... ............... ... 006 (7707.000.000) 03/23 09:58:26 Image size of job updated: 562936 ... 004 (7707.000.000) 03/23 10:39:50 Job was evicted. (0) Job was not checkpointed. Usr 0 17:02:20, Sys 0 00:00:31 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job ... 001 (7707.000.000) 03/23 10:43:55 Job executing on host: <10.255.255.230:40512> ... 006 (7707.000.000) 03/23 11:04:04 Image size of job updated: 541776 ...
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Tier2-admins mailing list  -  Tier2-admins@xxxxxxxxxxxxxxxxxxxxx
https://listhost.uchicago.edu/mailman/listinfo/tier2-admins

References:
- [Condor-users] Presentation from Condor Week
  - From: Srirangam Addepalli
- Re: [Condor-users] Presentation from Condor Week
  - From: Alain Roy
- [Condor-users] Unwanted job eviction
  - From: Marco Mambelli
- Re: [Condor-users] Unwanted job eviction
  - From: Dan Bradley
- Re: [Condor-users] Unwanted job eviction
  - From: Marco Mambelli

Prev by Date: Re: [Condor-users] Unwanted job eviction
Next by Date: [Condor-users] Trouble with how to submit many jobs through Condor-G
Previous by thread: Re: [Condor-users] Unwanted job eviction
Next by thread: RE: [Condor-users] Presentation from Condor Week
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [Tier2-admins] Re: [Condor-users] Unwanted job eviction