[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Struggling with Condor Flocking



Sai,
I think FLOCK_FROM on cm1.mygrid.com must be

FLOCK_FROM = *.mygrid.com

and NOT

FLOCK_FROM = *.hclgrid.com

Please also check that the FLOCK_TO variable is set in the condor config
files of the Schedd Machine.

Also, please make sure that the HOSTALLOW* entries in the config files
are set appropriately.  For details, consult
http://www.cs.wisc.edu/condor/manual/v6.7/5_2Connecting_Condor.html

Let me know if that helps.
--
Rajesh Rajamani
Senior Member of Technical Staff
Direct : +1.408.321.9000
Fax    : +1.408.904.5992
Mobile : +1.408.321.9030
raj@xxxxxxxxxx


Optena Corporation 2860 Zanker Road, Suite 201 San Jose, CA 95134 www.optena.com


This electronic transmission (and any attached documents) contains information from Optena Corporation and is for the sole use of the individual or entity it is addressed to. If you receive this message in error, please notify me and destroy the attached message (and all attached documents) immediately.


Aln Sai Srinivas - CTD, Chennai wrote:
Hi
I'm trying flocking between condor pools. I'm getting an error in
CollectorLog "DC_AUTHENTICATE: attempt to open invalid session
cm.mygrid:27485:1119363742:5, failing" where cm is the host name of
CentralManager of a pool that flocks to.
And flocking never happened.
Here is the scenario...
I'm using Redhat Linux and condor 6.6.9
There are two central managers cm.mygrid.com and cm1.mygrid.com represent
two condor polls respectively..
condor_config is on shared file system.
Here is the configuration ======================================================================
$LOCAL_DIR/condor_config.local for cm.mygrid.com
=======================================================================
COLLECTOR = $(SBIN)/condor_collector


DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD


COLLECTOR_NAME = Collector at alpha


CONTINUE = True


FILESYSTEM_DOMAIN = cm.mygrid.com


FLOCK_FROM = *.hclgrid.com


FLOCK_TO = cm1.mygrid.com


PREEMPT = FALSE


SUSPEND = FALSE


LOCK = /tmp/condor-lock.$(HOSTNAME)0.885447545050742


UID_DOMAIN = cm.mygrid.com


NEGOTIATOR = $(SBIN)/condor_negotiator
VACATE = FALSE


CONDOR_ADMIN = root@xxxxxxxxxxxxx


START = TRUE


MAIL = /bin/mail


CONDOR_IDS = 504.504


RELEASE_DIR = /usr/local/condor


CONDOR_HOST = cm.mygrid.com


LOCAL_DIR = /usr/local/condor/local.$(HOSTNAME)

======================================================================
$LOCAL_DIR/condor_config.local for cm1.mygrid.com
=======================================================================
COLLECTOR = $(SBIN)/condor_collector


DAEMON_LIST = MASTER, COLLECTOR, NEGOTIATOR, STARTD, SCHEDD


COLLECTOR_NAME = Collector at microgrid


CONTINUE = True


FILESYSTEM_DOMAIN = cm1.mygrid.com


FLOCK_FROM = *.hclgrid.com


FLOCK_TO = Not defined


PREEMPT = FALSE


SUSPEND = FALSE


LOCK = /tmp/condor-lock.$(HOSTNAME)0.505710791637288


UID_DOMAIN = cm1.mygrid.com


NEGOTIATOR = $(SBIN)/condor_negotiator


VACATE = FALSE


CONDOR_ADMIN = root@xxxxxxxxxxxxxx


START = TRUE
MAIL = /bin/mail


CONDOR_IDS = 504.504


RELEASE_DIR = /usr/local/condor


CONDOR_HOST = cm1.mygrid.com


LOCAL_DIR = /usr/local/condor/local.$(HOSTNAME)
============================================================================
===
CollectorLog at cm1.mygrid.com
============================================================================
=====
6/21 20:03:42 WARNING:  No master ad for < cm.mygrid.com >
6/21 20:03:42 ScheddAd    : Inserting ** "< cm.mygrid.com , 10.100.207.10 >"
6/21 20:03:42 stats: Inserting new hashent for
'Schedd':'cm.mygrid.com':'10.100.207.10'
6/21 20:03:42 SubmittorAd  : Inserting ** "< condor@xxxxxxxxxxxxx ,
10.100.207.10 >"
6/21 20:03:42 stats: Inserting new hashent for
'Submittor':'condor@xxxxxxxxxxxxx':'10.100.207.10'
6/21 20:04:09 Got QUERY_STARTD_ADS
6/21 20:04:09 (Sent 1 ads in response to query)
6/21 20:06:33 (Sent 5 ads in response to query)
6/21 20:06:33 Got QUERY_STARTD_PVT_ADS
6/21 20:06:33 (Sent 1 ads in response to query)
============================================================================
===
CollectorLog at cm.mygrid.com
============================================================================
=====
6/21 20:02:21 DC_AUTHENTICATE: attempt to open invalid session
alpha:27485:1119363742:5, failing.
6/21 20:03:21 SubmittorAd  : Inserting ** "< condor@xxxxxxxxxxxxx ,
10.100.207.10 >"
6/21 20:03:21 stats: Inserting new hashent for
'Submittor':'condor@xxxxxxxxxxxxx':'10.100.207.10'
6/21 20:03:21 (Sent 4 ads in response to query)
6/21 20:03:21 Got QUERY_STARTD_PVT_ADS
6/21 20:03:21 (Sent 1 ads in response to query)
6/21 20:03:41 (Sent 4 ads in response to query)
6/21 20:03:41 Got QUERY_STARTD_PVT_ADS
6/21 20:03:41 (Sent 1 ads in response to query)
6/21 20:03:49 DC_AUTHENTICATE: attempt to open invalid session
alpha:27485:1119363829:6, failing.

============================================================================
===
ScheddLog at cm.mygrid.com
============================================================================
=====
6/21 20:03:20 DaemonCore: Command received via UDP from host
<10.100.207.10:32990>
6/21 20:03:20 DaemonCore: received command 421 (RESCHEDULE), calling handler
(reschedule_negotiator)
6/21 20:03:21 Sent ad to central manager for condor@xxxxxxxxxxxxx
6/21 20:03:21 Called reschedule_negotiator()
6/21 20:03:21 DaemonCore: Command received via TCP from host
<10.100.207.10:41493>
6/21 20:03:21 DaemonCore: received command 416 (NEGOTIATE), calling handler
(negotiate)
6/21 20:03:21 Negotiating for owner: condor@xxxxxxxxxxxxx
6/21 20:03:21 Checking consistency running and runnable jobs
6/21 20:03:21 Tables are consistent
6/21 20:03:21 Out of jobs - 1 jobs matched, 0 jobs idle, flock level = 0
6/21 20:03:21 DaemonCore: Command received via UDP from host
<10.100.207.10:32991>
6/21 20:03:21 DaemonCore: received command 421 (RESCHEDULE), calling handler
(reschedule_negotiator)
6/21 20:03:21 Called reschedule_negotiator()
6/21 20:03:23 Started shadow for job 558.0 on "<10.100.207.10:41475>",
(shadow pid = 27559)
6/21 20:03:25 Sent ad to central manager for condor@xxxxxxxxxxxxx
6/21 20:03:42 Activity on stashed negotiator socket
6/21 20:03:42 Negotiating for owner: condor@xxxxxxxxxxxxx
6/21 20:03:42 Checking consistency running and runnable jobs
6/21 20:03:42 Tables are consistent
6/21 20:03:42 Out of servers - 0 jobs matched, 1 jobs idle, 1 jobs rejected
6/21 20:03:42 Increasing flock level for condor@xxxxxxxxxxxxx to 1.
6/21 20:03:42 Sent ad to central manager for condor@xxxxxxxxxxxxx
6/21 20:04:34 Shadow pid 27559 for job 558.0 exited with status 100
6/21 20:04:34 Started shadow for job 559.0 on "<10.100.207.10:41475>",
(shadow pid = 27569)

Could you plz help me where I'm missing..?

Regards
Sai
DISCLAIMER This message and any attachment(s) contained here are information that is confidential, proprietary to HCL Technologies and its customers. Contents may be privileged or otherwise protected by law. The information is solely intended for the individual or the entity it is addressed to. If you are not the intended recipient of this message, you are not authorized to read, forward, print, retain, copy or disseminate this message or any part of it. If you have received this e-mail in error, please notify the sender immediately by return e-mail and delete it from your computer



_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx https://lists.cs.wisc.edu/mailman/listinfo/condor-users