[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor 8.6.3 - Jobs evicted even if other slots are free




Hello,

I'd like to revive that thread: we couldn't solve the problem and so we still have jobs evicted for no obvious reason -- there are free slots in the system where the new jobs could run independently. And the longer a given job, the more likely it is too be affected by that issue => the affected dags take much longer to run than what they should.

Thanks in advance,

Nicolas

Le 13/03/2019 Ã 12:00, Giuseppe Di Biase a ÃcritÂ:
Hi All,

our HTCondor architecture consists of:

  * condorcl1:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ DAEMON_LISTÂÂÂ = MASTER,
    COLLECTOR, NEGOTIATOR, GANGLIAD, DEFRAG
  * submit1:ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ DAEMON_LISTÂÂÂ = MASTER, SCHEDD
  * olnode1..olnode64:ÂÂÂÂÂÂÂÂÂÂ DAEMON_LIST = MASTER, STARTD

*condorcl1 config.local is:*

COLLECTOR_NAME = $(CONDOR_HOST)
DAEMON_LISTÂÂÂ = MASTER, COLLECTOR, NEGOTIATOR, GANGLIAD, DEFRAG
DEFRAG_INTERVAL = 3600
DEFRAG_DRAINING_MACHINES_PER_HOUR = 1.0
DEFRAG_MAX_WHOLE_MACHINES = 20
DEFRAG_MAX_CONCURRENT_DRAINING = 10
DEFRAG_SCHEDULE = graceful

*submit1 config.local is:*

COLLECTOR_NAME = $(CONDOR_HOST)
DAEMON_LISTÂÂÂ = MASTER, SCHEDD
SUBMIT_REQUIREMENT_NAMES = $(SUBMIT_REQUIREMENT_NAMES) CheckExp
SUBMIT_REQUIREMENT_CheckExp = JobUniverse == 5 || JobUniverse == 7
SUBMIT_REQUIREMET_CheckExp_REASON = "Submissions must have +Experiment"
EVENT_LOG = /virgoLog/HTCondor/event_log/events.log

*olnodeXX config.local is:*

COLLECTOR_NAME = $(CONDOR_HOST)
DAEMON_LIST = MASTER, STARTD
NUM_SLOTS = 1
SLOT_TYPE_1 = cpus=100%
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1_PARTITIONABLE = TRUE
SUSPEND_VANILLA = False
PREEEMPT_VANILLA = False
KILL_VANILLA = False
START = TRUE
IsNoEMi = (Experiment =?= "NoEMi")
IsDetChar = (Experiment =?= "DetChar")
IscWB = (Experiment =?= "cWB")
IsNoEMiTest = (Experiment =?= "NoEMiTest")
RANK = $(IsDetChar)*70 + $(IsNoEMi)*10 + $(IsNoEMiTest)*12 + $(IscWB)*8
GROUP_QUOTA_DYNAMIC_virgo.prod.o3.detchar.linefind.noemi = .30
GROUP_QUOTA_DYNAMIC_virgo.prod.o3.detchar.transient.dqr = .30
GROUP_QUOTA_DYNAMIC_virgo.prod.o3.burst.allsky.cwbonline = .40

In this configuration "NoEMiTest" jobs in "virgo.prod.o3.detchar.linefind.noemi" (AccountingGroup) are always evicted by jobs with high priority (Experiment=DetChar) because they what to run on the same machines even if there are others free machines.

Can you point me to find out where is the issue?

Thanks

Giuseppe

--
===============================================
Giuseppe Di Biase -giuseppe.dibiase@xxxxxxxxx

European Gravitational Observatory - EGO
Via E.Amaldi - 56021 Cascina (Pisa) - IT
Phone: +39 050 752 577
===============================================


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/