Hi all,
I am building a cluster with solaris machines. So far I have 10+ machines with solaris 8 running without problems. My problems come when trying to include solaris 10 machines.
All machines share a condor user directory via NFS and therefore I have created a directory with subdirectories bin sbin libexec and lib both for solaris 8 and 10 and I define in each machine’s config file to use either directory, depending on the architecture.
The Solaris 10 machines start up the condor daemons without a problem, but the traces give me an error “ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed”
A piece of the logs are:
================ MasterLog ==========================
3/26 18:46:01 ProcAPI::buildFamily() Found daddypid on the system: 799
3/26 18:46:08 ProcAPI::buildFamily() Found daddypid on the system: 800
3/26 18:46:42 Getting monitoring info for pid 798
3/26 18:46:45 enter Daemons::UpdateCollector
3/26 18:46:45 Trying to update collector <10.95.5.97:9618>
3/26 18:46:45 Attempting to send update via UDP to collector vitorino.hi.inet <10.95.5.97:9618>
3/26 18:46:45 exit Daemons::UpdateCollector
3/26 18:46:52 enter Daemons::CheckForNewExecutable
3/26 18:46:52 Time stamp of running /home/usu/condor/condor_5.10/sbin/condor_master: 1167915061
3/26 18:46:52 GetTimeStamp returned: 1167915061
3/26 18:46:52 Time stamp of running /home/usu/condor/condor_5.10/sbin/condor_schedd: 1167915026
3/26 18:46:52 GetTimeStamp returned: 1167915026
3/26 18:46:52 Time stamp of running /home/usu/condor/condor_5.10/sbin/condor_startd: 1167915022
3/26 18:46:52 GetTimeStamp returned: 1167915022
3/26 18:46:52 exit Daemons::CheckForNewExecutable
3/26 18:47:01 ProcAPI::buildFamily() Found daddypid on the system: 799
3/26 18:47:06 attempt to connect to <10.95.5.97:9618> failed: Failed to set timeout..
3/26 18:47:06 ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed
3/26 18:47:06 Failed to start non-blocking update to <10.95.5.97:9618>.
================= SchedLog =======================
3/26 18:37:26 (pid:799) attempt to connect to <10.95.5.97:9618> failed: Failed to set timeout..
3/26 18:37:26 (pid:799) ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed
3/26 18:37:26 (pid:799) Failed to start non-blocking update to <10.95.5.97:9618>.
3/26 18:38:51 (pid:799) Getting monitoring info for pid 799
3/26 18:42:05 (pid:799) JobsRunning = 0
3/26 18:42:05 (pid:799) JobsIdle = 0
3/26 18:42:05 (pid:799) JobsHeld = 0
3/26 18:42:05 (pid:799) JobsRemoved = 0
3/26 18:42:05 (pid:799) LocalUniverseJobsRunning = 0
3/26 18:42:05 (pid:799) LocalUniverseJobsIdle = 0
3/26 18:42:05 (pid:799) SchedUniverseJobsRunning = 0
3/26 18:42:06 (pid:799) SchedUniverseJobsIdle = 0
3/26 18:42:06 (pid:799) N_Owners = 0
3/26 18:42:06 (pid:799) MaxJobsRunning = 200
3/26 18:42:06 (pid:799) Trying to update collector <10.95.5.97:9618>
3/26 18:42:06 (pid:799) Attempting to send update via UDP to collector vitorino.hi.inet <10.95.5.97:9618>
3/26 18:42:06 (pid:799) Sent HEART BEAT ad to 1 collectors. Number of submittors=0
3/26 18:42:06 (pid:799) ============ Begin clean_shadow_recs =============
3/26 18:42:06 (pid:799) ============ End clean_shadow_recs =============
3/26 18:42:06 (pid:799) -------- Begin starting jobs --------
3/26 18:42:06 (pid:799) -------- Done starting jobs --------
3/26 18:42:27 (pid:799) attempt to connect to <10.95.5.97:9618> failed: Failed to set timeout..
3/26 18:42:27 (pid:799) ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed
================= StartLog =====================================
3/26 18:42:35 Failed to start non-blocking update to <10.95.5.97:9618>.
3/26 18:43:10 Getting monitoring info for pid 800
3/26 18:45:10 DaemonCore: in SendAliveToParent()
3/26 18:45:10 DaemonCore: attempting to connect to '<10.95.109.196:32853>'
3/26 18:47:10 Swap space: 818600
3/26 18:47:10 3635528 kbytes available for "/home/usu/condor/hosts/kang/execute"
3/26 18:47:10 Looking up RESERVED_DISK parameter
3/26 18:47:10 Reserving 5120 kbytes for file system
3/26 18:47:10 Disk space: 3630408
3/26 18:47:10 State change: IS_OWNER is TRUE
3/26 18:47:10 Changing state: Unclaimed -> Owner
3/26 18:47:11 Getting monitoring info for pid 800
3/26 18:47:15 Trying to update collector <10.95.5.97:9618>
3/26 18:47:15 Attempting to send update via UDP to collector vitorino.hi.inet <10.95.5.97:9618>
3/26 18:47:15 Sent update to 1 collector(s)
3/26 18:47:36 attempt to connect to <10.95.5.97:9618> failed: Failed to set timeout..
3/26 18:47:36 ERROR: SECMAN:2003:TCP connection to <10.95.5.97:9618> failed
3/26 18:47:36 Failed to start non-blocking update to <10.95.5.97:9618>.
A bit more info: the solaris 10 machines can execute condor_status and give a list of all other machines, but they do not appear there.
Thanks a lot for any help you can give me