Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] How to find the name of the condor_collecter and the name of condor_schedd daemon?Thank you very much.
- Date: Tue, 14 Jun 2016 09:16:24 +0800
- From: "btdan@xxxxxxx" <btdan@xxxxxxx>
- Subject: Re: [HTCondor-users] How to find the name of the condor_collecter and the name of condor_schedd daemon?Thank you very much.
Dear Todd:
Thank you very much for your reply.
But the problem is still there. Can you help me to resolve it?Thank you very much.
In fact, I have already modified the /etc/condor/condor_config file for pool A (named as 181.nodeljA) ,add the following content into the file:
FLOCK_TO =188.nodeljB
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS), $(IP_ADDRESS)
CONDOR_GAHP = $(SBIN)/condor_c-gahp
C_GAHP_LOG = /tmp/CGAHPLog.$(USERNAME)
C_GAHP_WORKER_THREAD_LOG = /tmp/CGAHPWorkerLog.$(USERNAME)
C_GAHP_WORKER_THREAD_LOCK = /tmp/CGAHPWorkerLock.$(USERNAME)
and the same file for pool B (named as 188.nodeljB) is modified:
FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)
FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)
ALLOW_ADMINISTRATOR = $(CONDOR_HOST), $(IP_ADDRESS)
ALLOW_OWNER = $(FULL_HOSTNAME), $(ALLOW_ADMINISTRATOR)
ALLOW_NEGOTIATOR = xxx@$(CONDOR_HOST), $(IP_ADDRESS)
ALLOW_NEGOTIATOR_SCHEDD = $(CONDOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS), $(IP_ADDRESS)
ALLOW_WRITE_COLLECTOR = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_WRITE_STARTD = $(ALLOW_WRITE), $(FLOCK_FROM)
ALLOW_READ_COLLECTOR = $(ALLOW_READ), $(FLOCK_FROM)
ALLOW_READ_STARTD = $(ALLOW_READ), $(FLOCK_FROM)
LOCK = $(LOCAL_DIR)/lock/condor
SEC_DEFAULT_NEGOTIATION = OPTIONAL
SEC_DEFAULT_AUTHENTICATION_METHODS = CLAIMTOBE
and the condor submit description file(named as sub.txt) looks like as:
executable=/data/condor_test/CondorTest.class
input=/data/condor_test/list.txt
arguments=CondorTest181795_2014-05-14_152801.mp4
log=/data/condor_test/condor.log
error=/data/condor_test/condor.error
grid_resource=condor 188.nodeljB 188.nodeljB
+remote_universe=10
+remote_requirements=True
+remote_ShouldTransferFiles='YES'
queue
when I run the command : condor_submit sub.txt
All the jobs are held.And the log tell me that:
012 (031.239.000) 06/14 08:54:52 Job was held.
GridResource missing pool name
Code 0 Subcode 0
And the content of /var/log/condor/GridmanagerLog.xxx is as follows:
06/14/16 09:01:06 ******************************************************
06/14/16 09:01:06 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
06/14/16 09:01:06 ** /usr/sbin/condor_gridmanager
06/14/16 09:01:06 ** SubsystemInfo: name=GRIDMANAGER type=DAEMON(12) class=DAEMON(1)
06/14/16 09:01:06 ** Configuration: subsystem:GRIDMANAGER local:<NONE> class:DAEMON
06/14/16 09:01:06 ** $CondorVersion: 8.4.7 Jun 03 2016 BuildID: 369249 $
06/14/16 09:01:06 ** $CondorPlatform: x86_64_RedHat6 $
06/14/16 09:01:06 ** PID = 3607
06/14/16 09:01:06 ** Log last touched 6/14 08:54:57
06/14/16 09:01:06 ******************************************************
06/14/16 09:01:06 Using config source: /etc/condor/condor_config
06/14/16 09:01:06 Using local config sources:
06/14/16 09:01:06 /etc/condor/condor_config.local
06/14/16 09:01:06 config Macros = 62, Sorted = 62, StringBytes = 1644, TablesBytes = 2272
06/14/16 09:01:06 CLASSAD_CACHING is ENABLED
06/14/16 09:01:06 Daemon Log is logging: D_ALWAYS D_ERROR
06/14/16 09:01:06 Daemoncore: Listening at <0.0.0.0:55344> on TCP (ReliSock) and UDP (SafeSock).
06/14/16 09:01:06 DaemonCore: command socket at <192.168.1.181:55344?addrs=192.168.1.181-55344>
06/14/16 09:01:06 DaemonCore: private command socket at <192.168.1.181:55344?addrs=192.168.1.181-55344>
06/14/16 09:01:09 [3607] Found job 32.0 --- inserting
06/14/16 09:01:09 [3607] Found job 32.1 --- inserting
06/14/16 09:01:10 [3607] (32.0) doEvaluateState called: gmState GM_HOLD, remoteState -1
06/14/16 09:01:10 [3607] (32.1) doEvaluateState called: gmState GM_HOLD, remoteState -1
06/14/16 09:01:15 [3607] No jobs left, shutting down
06/14/16 09:01:15 [3607] Got SIGTERM. Performing graceful shutdown.
06/14/16 09:01:15 [3607] **** condor_gridmanager (condor_GRIDMANAGER) pid 3607 EXITING WITH STATUS 0
I think that the reason is I have not give the right parament for grid_resource the submit description file.
The result of command "condor_status -schedd" is as follows:
[root@188 ~]# condor_status -schedd
Name Machine RunningJobs IdleJobs HeldJobs
188.nodeljB 188.nodeljB 0 0 0
TotalRunningJobs TotalIdleJobs TotalHeldJobs
Total 0 0 0
and the result of command "condor_status -collector" is as follows:
[root@188 ~]# condor_status -collector
Name Machine RunningJobs IdleJobs HostsTotal
"Condor Pool of LJ"@151.nodelj 151.nodelj 0 0 0
"CPLJ"@188.nodeljB 188.nodeljB 0 0 4
"Condor Pool of LJ"@188.nodeljB 188.nodeljB 0 0 0
Can you help me to find the method to this problem?
Thank you very much.
Best regards.
Date: Mon, 13 Jun 2016 12:46:37 -0500
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] How to find the name of the
condor_collecter and the name of condor_schedd daemon?Thank you very
much.
Message-ID: <575EF17D.3010607@xxxxxxxxxxx>
Content-Type: text/plain; CHARSET=US-ASCII; format=flowed
On 6/12/2016 3:23 AM, HTCondor wrote:
> Dear all,
> I am configuring the HTCondor Flock. There is an parament ind the
> submit description file for a job, that is grid_resource
> According to the manul of HTCondor, the third field is the name of
> the remote pool's condor_collecter , and the second field is the name
> of the remote condor_schedd daemon.
> Can you tell me how to find the detail value of them ?
> Thank you very much.
> Best regards
>
> David
>
Hi David,
Flocking is HTCondor's way of allowing jobs that cannot immediately run
within the pool of machines where the job was submitted to instead run
on a different HTCondor pool. If a machine within HTCondor pool A can
send jobs to be run on HTCondor pool B, then we say that jobs from
machine A flock to pool B.
If Flocking is what you want, you don't need to mess around with grid
universe, grid_resource, or any of that. On the condor_config for
machines in pool A just modify the FLOCK_TO line to include the hostname
of pool B central manager, and on the condor_configs for machines in
pool B just modify the FLOCK_FROM line to include the hostname of pool A
central manager.
Details are in:
regards,
Todd