Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Parallel universe job can't match ?
- Date: Thu, 18 Jan 2007 13:37:59 -0800
- From: Yang Y Yang <yyang@xxxxxxxxxxxxx>
- Subject: [Condor-users] Parallel universe job can't match ?
fellow condor users:
I am trying to submit a parallel universe job to condor.
my condor pool is a single machine pool, with master, negotiator,
collector, running under my own username, i.e. a "personal condor".
I first added configuration to my local condor config file, to specify
that the schedd is a dedicated scheduler, and that the startd takes
request from that dedicated scheduler.
then I can submit normal vanilla jobs, and they are executed.
but when I submit a parallel universe job, which node request = 1, it is
never executed. I looked at Negotiator log (section 5 in attached debug
file) , it says "no match found, and
job rejected". why is this? condor_q and condor_status (section 1,2 in
attached file) shows that the job is lying idle, and machine is in
unclaimed state.
I don't know why the schedd classAd and Startd classAd can't be matched.
anybody could give a clue?
also I see that there are 2 schedd classAds posted, one with the normal
yyang@hostname identifier, the other with
DedicatedScheduler@yyang@hostname. is it true that whenever I submit a
parallel job, it goes to negotiator with both normal request and dedicated
request, so that hopefully one would match?
Thanks a lot
Yang
********************************************************************************
---1)condor_q output:
-- Submitter: stocksong.corp.yahoo.com : <10.72.107.32:38440> : stocksong.corp.yahoo.com
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
2.0 yyang 1/18 12:05 0+00:00:00 I 0 9.8 echo hello
1 jobs; 1 idle, 0 running, 0 held
********************************************************************************
----2)condor_status output:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
stocksong.cor LINUX INTEL Unclaimed Idle 0.310 1003 0+00:54:51
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/LINUX 1 0 0 1 0 0 0
Total 1 0 0 1 0 0 0
********************************************************************************
---- 3) par.job file
######################################
## Parallel example submit description file
######################################
universe = parallel
executable = /bin/echo
log = logfile
output = outfile.$(NODE)
error = errfile.$(NODE)
Arguments = hello
machine_count = 1
queue
********************************************************************************
----4) extra dedicated schedd and startd config
# schdd identity
DedicatedScheduler = "DedicatedScheduler@yyang@stocksong.corp.yahoo.com"
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
# startd policy
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE = False
RANK = Scheduler =?= $(DedicatedScheduler)
NEGOTIATOR_INTERVAL = 10
ALL_DEBUG=D_ALL
********************************************************************************
--5) Negotiator log
1/18 12:58:04 (fd:7) (pid:10718) Phase 4.1: Negotiating with schedds ...
1/18 12:58:04 (fd:7) (pid:10718) NumStartdAds = 1
1/18 12:58:04 (fd:7) (pid:10718) NormalFactor = 2.000000
1/18 12:58:04 (fd:7) (pid:10718) MaxPrioValue = 0.500000
1/18 12:58:04 (fd:7) (pid:10718) NumScheddAds = 2
1/18 12:58:04 (fd:7) (pid:10718) Negotiating with DedicatedScheduler@yyang@stocksong.corp.yahoo.com at <10.72.107.32:38440>
1/18 12:58:04 (fd:7) (pid:10718) 0 seconds so far
1/18 12:58:04 (fd:7) (pid:10718) NEGOTIATOR_IGNORE_USER_PRIORITIES is undefined, using default value of False
1/18 12:58:04 (fd:7) (pid:10718) Calculating schedd limit with the following parameters
1/18 12:58:04 (fd:7) (pid:10718) ScheddPrio = 0.500000
1/18 12:58:04 (fd:7) (pid:10718) ScheddPrioFactor = 1.000000
1/18 12:58:04 (fd:7) (pid:10718) scheddShare = 0.500000
1/18 12:58:04 (fd:7) (pid:10718) scheddAbsShare = 0.500000
1/18 12:58:04 (fd:7) (pid:10718) ScheddUsage = 0
1/18 12:58:04 (fd:7) (pid:10718) scheddLimit = 0
1/18 12:58:04 (fd:7) (pid:10718) MaxscheddLimit = 0
1/18 12:58:04 (fd:7) (pid:10718) Socket to <10.72.107.32:38440> already in cache, reusing
1/18 12:58:04 (fd:7) (pid:10718) Over submitter resource limit (0) ... only consider startd ranks
1/18 12:58:04 (fd:7) (pid:10718) Sending SEND_JOB_INFO/eom
1/18 12:58:04 (fd:7) (pid:10718) Getting reply from schedd ...
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfds=7
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfound=1
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfds=7
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfound=1
1/18 12:58:04 (fd:7) (pid:10718) Got JOB_INFO command; getting classad/eom
1/18 12:58:04 (fd:7) (pid:10718) Request 00002.00000:
1/18 12:58:04 (fd:7) (pid:10718) Rejected 2.0 DedicatedScheduler@yyang@stocksong.corp.yahoo.com <10.72.107.32:38440>: no match found
1/18 12:58:04 (fd:7) (pid:10718) Sending SEND_JOB_INFO/eom
1/18 12:58:04 (fd:7) (pid:10718) Getting reply from schedd ...
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfds=7
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfound=1
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfds=7
1/18 12:58:04 (fd:7) (pid:10718) condor_read(): nfound=1
1/18 12:58:04 (fd:7) (pid:10718) Got NO_MORE_JOBS; done negotiating
1/18 12:58:04 (fd:7) (pid:10718) This schedd hit its scheddlimit.
1/18 12:58:04 (fd:7) (pid:10718) NEGOTIATOR_IGNORE_USER_PRIORITIES is undefined, using default value of False
1/18 12:58:04 (fd:7) (pid:10718) Negotiating with yyang@xxxxxxxxxxxxxxxxxxxxxxxx skipped because no idle jobs
1/18 12:58:04 (fd:7) (pid:10718) Schedd yyang@xxxxxxxxxxxxxxxxxxxxxxxx got all it wants; removing it.
1/18 12:58:04 (fd:7) (pid:10718) ---------- Finished Negotiation Cycle ----------
1/18 12:58:04 (fd:7) (pid:10718) in DaemonCore NewTimer()