Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] condor-g & matching a cluster to multiple jobs at once
- Date: Wed, 3 Sep 2008 15:45:51 -0500
- From: Warren Smith <wsmith@xxxxxxxxxxxxxxx>
- Subject: [Condor-users] condor-g & matching a cluster to multiple jobs at once
Hi, I'm working on deploying Condor-G and matchmaking. My problem is
that while jobs are being matched and executed, they are only matched to
a system one at a time. I'd like Condor-G to have several jobs submitted
to a system at the same time. I have a simple test job that only can
match to a single class ad:
executable = /bin/hostname
arguments = --fqdn
transfer_executable = false
output = hostname-match-$(CLUSTER)-$(PROCESS).out
error = hostname-match-$(CLUSTER)-$(PROCESS).err
log = hostname-match-$(CLUSTER)-$(PROCESS).log
universe = grid
x509userproxy=/home/utexas/staff/wsmith/.globus/userproxy.pem
grid_resource = $$(GridResource)
Requirements = (Name=="tacc.lonestar.serial")
globusrsl = (maxWallTime=5)(count=1)(queue=$$(Queue))
queue 10
And the classad in Condor is:
lslogin2$ condor_status -l tacc.lonestar.serial
MyType = "Machine"
TargetType = "Job"
Requirements = (TARGET.JobUniverse == 9)
Rank = 0.000000
CurrentRank = 0.000000
WantAdRevaluate = TRUE
CurMatches = 0
Name = "tacc.lonestar.serial"
Machine = "gatekeeper.lonestar.tacc.teragrid.org"
StartdIpAddr = "<129.114.50.32>"
GridResource = "gt2
gatekeeper.lonestar.tacc.teragrid.org:2119/jobmanager-lsf"
State = "Unclaimed"
Activity = "Idle"
UpdateSequenceNumber = 1220367368
Arch = "X86_64"
OpSys = "LINUX"
LoadAvg = 0.865580
TotalMemory = 11840721
Memory = 1725537
Queue = "serial"
Priority = 0.030000
MaxWallTime = 720
MaxProcessors = 1
MyAddress = "<192.5.198.172:0>"
LastHeardFrom = 1220367369
UpdatesTotal = 1328
UpdatesSequenced = 0
UpdatesLost = 0
UpdatesHistory = "0x00000000000000000000000000000000"
From the Condor manual, it seems like setting WantAdRevaluate to True
will result in Condor matching multiple jobs to this system. What I'm
seeing is that the jobs run one at a time on the system. Here's part of
the MatchLog:
9/2 09:48:49 Matched 153.0 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 09:48:49 Rejected 153.1 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 09:53:51 Matched 153.1 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 09:53:51 Rejected 153.2 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 09:58:52 Matched 153.2 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 09:58:52 Rejected 153.3 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:03:53 Matched 153.3 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:03:53 Rejected 153.4 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:08:55 Matched 153.4 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:08:55 Rejected 153.5 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:13:56 Matched 153.5 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:13:56 Rejected 153.6 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:18:58 Matched 153.6 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:18:58 Rejected 153.7 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:24:00 Matched 153.7 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:24:00 Rejected 153.8 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:29:01 Matched 153.8 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
9/2 10:29:01 Rejected 153.9 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761>: no match found
9/2 10:34:02 Matched 153.9 wsmith@xxxxxxxxxxxxxxxxx
<129.114.69.97:50761> preempting none <129.114.50.32> tacc.lonestar.serial
As you can see, all of the jobs get matched and run, but only one gets
matched every 5 mins (every Negotiator cycle?). The serial queue on
lonestar was empty so the jobs ran quickly.
The collector and negotiator are from Condor 7.1.0. I sent an earlier
query to the list about a STARTD_AD_REEVAL_EXPR error message in my
NegotiatorLog that I don't think is related to this...
Thanks for the help,
Warren