Greetings all,
We have a QA condor (8.4.6) system here with about 2 slots with more coming soon (yay!). They are both running CentOS 6.7. All users are on the same uid_domain and all servers are on the same filesystem_domain.Â
We would like to set up groups where group a gets a more resources than group b, which is more than group c, etc. I would also like to set up a set of group_accept_surplus rules to allow the unused slots to be reallocated, but would like to get this sorted out first.
Here is how I have defined the groups.
GROUP_NAMES= group_a, group_b, group_c, group_d, group_e, group_f, group_g
# The following is based on 7 groups, must add up to 1
# group_a gets 7x, group_g = 1x, totals
# 1/28x where x comes out to 0.0357142857142857
GROUP_QUOTA_DYNAMIC_group_a = 0.250
GROUP_QUOTA_DYNAMIC_group_b = 0.214
GROUP_QUOTA_DYNAMIC_group_c = 0.179
GROUP_QUOTA_DYNAMIC_group_d = 0.143
GROUP_QUOTA_DYNAMIC_group_e = 0.107
GROUP_QUOTA_DYNAMIC_group_f = 0.071
GROUP_QUOTA_DYNAMIC_group_g = 0.036
When I add the accounting_groupÂto the submit file, the job just hangs out in the Idle state. Here is the submit file:
# Unix submit description file
# sleep.sub -- simple sleep job
executable       Â= sleep.sh
log           = sleep.log
output         Â= outfile.txt
error          = errors.txt
accounting_group    Â= group_a
should_transfer_files  = Yes
when_to_transfer_output = ON_EXIT
queue
condor_q -better-analyze shows this:
[cyang@centos sleep]$ condor_q -better-analyze Â148.0
User priority for cyang@xxxxxxxxxxx is not available, attempting to analyze without it. ---
148.000: ÂRun analysis summary. Of 2 machines,
   0 are rejected by your job's requirements
   0 reject your job because of their own requirements
   0 match and are already running your jobs
   0 match but are serving other users
   0 are available to run your job
    No successful match recorded.
    Last failed match: Wed May Â4 09:48:46 2016
    Reason for last match failure: no match found
The Requirements _expression_ for your job is:
  ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
  ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
  ( TARGET.HasFileTransfer )
Your job defines the following attributes:
  DiskUsage = 1
  ImageSize = 1
  RequestDisk = 1
  RequestMemory = 1
The Requirements _expression_ for your job reduces to these conditions:
    ÂSlots
Step  ÂMatched ÂCondition
----- Â-------- Â---------
[0] Â Â Â Â Â 2 ÂTARGET.Arch == "X86_64"
[1] Â Â Â Â Â 2 ÂTARGET.OpSys == "LINUX"
[3] Â Â Â Â Â 2 ÂTARGET.Disk >= RequestDisk
[5] Â Â Â Â Â 2 ÂTARGET.Memory >= RequestMemory
[7] Â Â Â Â Â 2 ÂTARGET.HasFileTransfer
Suggestions:
  Condition             Machines Matched  ÂSuggestion
  ---------             ----------------  Â----------
1 Â ( TARGET.Arch == "X86_64" ) Â Â Â 2
2 Â ( TARGET.OpSys == "LINUX" ) Â Â Â 2
3 Â ( TARGET.Disk >= 1 ) Â Â Â Â Â Â Â2
4 Â ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
                   2
5 Â ( TARGET.HasFileTransfer ) Â Â Â Â2
So, it looks like the machines match, but yet it won't run. When I use the +AccountingGroup = group_a directive, it runs without any problem.Â
Additionally, condor_userprio shows just
cyang@xxxxxxxxxxx with no associated groups.
[cyang@rhw1160 sleepjob]$ condor_userprio
Last Priority Update: Â5/4 Â09:51
              ÂEffective  Priority  Res  Total Usage ÂTime Since
User Name           Priority  ÂFactor  In Use (wghted-hrs) Last Usage
--------------------------- ------------ --------- ------ ------------ ----------
--------------------------- ------------ --------- ------ ------------ ----------
Number of users: 1 Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â0 Â Â Â Â 0.30 Â Â0+23:59
Any thoughts as to why the jobs are held? Or am I doing something obviously wrong?
Thanks.