Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Debugging a DedicatedScheduler?
- Date: Fri, 20 Jun 2014 12:25:32 +0200
- From: Steffen Grunewald <Steffen.Grunewald@xxxxxxxxxx>
- Subject: Re: [HTCondor-users] Debugging a DedicatedScheduler?
On Thu, Jun 19, 2014 at 10:02:07AM -0500, Greg Thain wrote:
> On 06/19/2014 05:32 AM, Steffen Grunewald wrote:
> >
> >Apparently the DS isn't running - what am I missing, and how would
> >I find out more?
> >
>
> Currently, condor_q -analyze doesn't know about the dedicated
> scheduler. The first thing you want to do is make sure that the
> startd's idea of the schedd's name match the schedd's idea. So, see
> which dedicated scheduler name the startds advertise they are
> willing to be managed by:
>
> condor_status -af DedicatedScheduler
# condor_status -af DedicatedScheduler | uniq -c
${node_count} undefined
> The output should be something like
>
> DedicatedScheduler@my_schedd_name
>
> Verify that the string after the (first) at sign matches
>
> condor_status -schedd -af Name
returns the public FQDN(s) properly.
I suppose the "undefined" string is not what you'd expect, and I'd have
to ssh to one of the nodes to check why:
# condor_config_val -dump | grep STARTD
ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
COLLECTOR_REPEAT_STARTD_ADS = 0
DAEMON_LIST = MASTER, STARTD
HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)
HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
MAX_STARTD_LOG = 10000000
NEGOTIATOR_INFORM_STARTD = true
NEGOTIATOR_USE_NONBLOCKING_STARTD_CONTACT = true
SCHEDD_USES_STARTD_FOR_LOCAL_UNIVERSE = True
SETTABLE_ATTRS_ADVERTISE_STARTD =
STARTD = $(SBIN)/condor_startd
STARTD_AD_REEVAL_EXPR =
STARTD_ADDRESS_FILE = $(RUN)/StartdAddress
STARTD_ATTRS = COLLECTOR_HOST_STRING, DedicatedScheduler
STARTD_CLAIM_ID_FILE = $(RUN)/StartdClaimId
STARTD_COMPUTE_AVAIL_STATS = false
STARTD_CONTACT_TIMEOUT = 45
STARTD_CRON_AUTOPUBLISH = If_Changed
STARTD_CRON_NAME =
STARTD_DEBUG = D_COMMAND
STARTD_FACTORY_SCRIPT_AVAILABLE_PARTITIONS =
STARTD_FACTORY_SCRIPT_BACK_PARTITION =
STARTD_FACTORY_SCRIPT_BOOT_PARTITION =
STARTD_FACTORY_SCRIPT_DESTROY_PARTITION =
STARTD_FACTORY_SCRIPT_GENERATE_PARTITION =
STARTD_FACTORY_SCRIPT_QUERY_WORK_LOADS =
STARTD_FACTORY_SCRIPT_SHUTDOWN_PARTITION =
STARTD_HAS_BAD_UTMP = 0
STARTD_HISTORY = $(LOG)/StartdHistory
STARTD_JOB_EXPRS = ImageSize, ExecutableSize, JobUniverse, NiceUser
STARTD_JOB_HOOK_KEYWORD =
STARTD_LOG = $(LOG)/StartdLog
STARTD_MAX_AVAIL_PERIOD_SAMPLES = 100
STARTD_NAME =
STARTD_NOCLAIM_SHUTDOWN = 0
STARTD_RESOURCE_PREFIX =
STARTD_SENDS_ALIVES = peer
STARTD_SHOULD_WRITE_CLAIM_ID_FILE = true
STARTD_SLOT_ATTRS = State, Activity, EnteredCurrentActivity
STARTD_SLOT_EXPRS =
STARTD_VM_ATTRS =
STARTD_VM_EXPRS =
# condor_config_val -dump | grep Scheduler
DedicatedScheduler = $(DEDICATED_SCHEDULER)
IsScheduler = (TARGET.JobUniverse == $(SCHEDULER_U))
START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 10
STARTD_ATTRS = COLLECTOR_HOST_STRING, DedicatedScheduler
# condor_config_val -dump | grep Scheduler
DedicatedScheduler = $(DEDICATED_SCHEDULER)
IsScheduler = (TARGET.JobUniverse == $(SCHEDULER_U))
START_SCHEDULER_UNIVERSE = TotalSchedulerJobsRunning < 10
STARTD_ATTRS = COLLECTOR_HOST_STRING, DedicatedScheduler
# condor_config_val -dump | grep DEDICATED
DEDICATED_SCHEDULER = $(MASTER_MACHINE).(...)
...
and that one *is* properly defined as MASTER_MACHINE is set
(it's the one runing collector and negotiator)
I'm pretty sure I followed the config docs page by page, but I must've missed
something important along the way :(
What's worse: I can't see what and where :(:(
Thanks,
Steffen
--
Steffen Grunewald * Cluster Admin * steffen.grunewald(*)aei.mpg.de
MPI f. Gravitationsphysik (AEI) * Am Mühlenberg 1, D-14476 Potsdam
http://www.aei.mpg.de/ * ------- * +49-331-567-{fon:7274,fax:7298}