I am a new Condor user and am stalled attempting to execute a Parallel Universe job on a setup with 2 host machines.
Our software is designed to scale to thousands of host machines, with the aggregate behaving as if was a single logical database running on a huge central machine. As such, HTCondor seems like the ideal tool to manage our multi-host testing requirements.
Condor installation and initial setup was easy. Specifics:
- version 7.6.10
- non-root install
- Manager, Scheduler and 1 Execute node on host p1
- Pooled execute node on host p2
- after initial install a simple Vanilla Universe job ran fine
- have a shared main condor_config, plus a localized condor_config.local for each host
I made the following config file changes (note, in all these samples I'm replacing IP addresses and full hostnames with short hostname):
============ Main condor.config ================
## central pool manager
COLLECTOR_NAME = NuoDB-DHentchel-p1
## Map Scheduler to Parallel Universe group; this defines a pool for concurrent, parallel runs
SCHEDD_NAME = DedicatedScheduler
DedicatedScheduler = "DedicatedScheduler@p1"
ParallelSchedulingGroup = "P5"
## Security settings:
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
============ Each condor_config.local ================
## Bind to parent scheduler group, to enable parallel universe dispatch
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler
STARTD_ATTRS = $(STARTD_ATTRS), ParallelSchedulingGroup
## Tune STARTD for dedicated, parallel scheduling
START = True
RANK = Scheduler =?= $(DedicatedScheduler)
LOCAL_DIR = /var/local/condor/$(HOSTNAME)
(For p1 only):
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), p2
(for p2 only):
DAEMON_LIST = MASTER, STARTD
I validated that the critical config variables are set on both machines:
condor_config_val COLLECTOR_NAME ==> NuoDB-DHentchel-p1
condor_config_val SCHEDD_NAME ==> DedicatedScheduler
condor_config_val DedicatedScheduler ==> "DedicatedScheduler@p1"
condor_config_val ParallelSchedulingGroup ==> "P1"
condor_config_val STARTD_ATTRS ==> COLLECTOR_HOST_STRING, DedicatedScheduler, ParallelSchedulingGroup
condor_config_val RANK ==> Scheduler =?= "DedicatedScheduler@p1"
Now I submit the following job:
universe = parallel
Scheduler = "DedicatedScheduler@p1"
executable = /bin/sleep
arguments = 30
machine_count = 2
+WantParallelSchedulingGroups = True
error = log/simple-sleep.$(PID).err
output = log/simple-sleep.$(PID).out
log = log/simple-sleep.$(PID).log
queue
6. This sits on the queue and is never dispatched.
-- Submitter: DedicatedScheduler@p1: <p1:51203> : p1
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 dhentchel 3/21 16:12 0+00:00:00 I 0 0.0 sleep 30
2.0 dhentchel 3/22 09:57 0+00:00:00 I 0 0.0 sleep 30
From the log files, I see some interesting messages.
CollectorLog:
03/21/13 16:12:34 SubmittorAd : Inserting ** "<
dhentchel@xxxxxxxxx DedicatedScheduler@p1 , p1 >"
MatchLog:
03/22/13 09:57:46 Matched 2.0
dhentchel@xxxxxxxxx <p1:51203> preempting none <p2:33543> slot1@p2
03/22/13 09:58:46 Matched 2.0
dhentchel@xxxxxxxxx <p1:51203> preempting none <p2:33543> slot2@p2
etc, etc. all expected slots on both hosts show up
SchedLog
03/21/13 16:09:08 (pid:10111) ** SubsystemInfo: name=SCHEDD type=SCHEDD(5) class=DAEMON(1)
03/21/13 16:12:34 (pid:10111) TransferQueueManager stats: active up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
03/21/13 16:12:34 (pid:10111) Inserting new attribute Scheduler into non-active cluster cid=1 acid=-1
03/21/13 16:12:34 (pid:10111) Trying to satisfy job with group scheduling
03/21/13 16:12:34 (pid:10111) Job requested parallel scheduling groups, but no groups found
So all condor processes seem to be communicating correctly, but the submit request with "universe = parallel"
is failing with message "Job requested parallel scheduling groups, but no groups found"
I'm hoping that someone can identify what is missing in the configuration, or perhaps give me advice on how to dig deeper to find where the machine pool setup is going astray.