Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Getting CentOS6 node into CO7 cluster
- Date: Wed, 08 Aug 2018 08:27:15 +0000
- From: Ben Pietras <ben.pietras@xxxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] Getting CentOS6 node into CO7 cluster
Hi,
Apologies if this isn't the appropriate channel; my first post.
I have 1 master and 10 nodes all on CentOS7, HTCondor 8.6.10
I have to keep SL6.9 on this particular machine and want to include it in the cluster
condor_status shows the SL6.9 machine threads as available, but never actually claims them (the job does run outside of condor on the SL6.9).
slot8@fastpc30 LINUX X86_64 Claimed Busy 0.730 1970 0+00:00:03
slot1@fastpc31 LINUX X86_64 Unclaimed Idle 0.610 1994 0+00:44:37
[...]
condor_q -better-analyze 6750
6750.1069: Run analysis summary ignoring user priority. Of 252 machines,
0 are rejected by your job's requirements
0 reject your job because of their own requirements
244 match and are already running your jobs
0 match but are serving other users
0 are available to run your job
----------------------
On the SL6.9 machine
----------------------
cat /var/log/messages | grep condor
Aug 7 14:34:34 fastpc31 yum[1962]: Installed: condor-8.6.10-1.el6.x86_64
Aug 7 14:35:21 fastpc31 htcondor: Not changing GLOBAL_MAX_FDS (/proc/sys/fs/file-max): new value (32768) <= old value (1606869).
Aug 7 14:35:21 fastpc31 htcondor: Not changing TCP_LISTEN_QUEUE (/proc/sys/net/core/somaxconn): new value (1024) <= old value (1024).
Aug 7 14:35:21 fastpc31 htcondor: Not changing ROOT_MAXKEYS (/proc/sys/kernel/keys/root_maxkeys): new value (1000000) <= old value (1000000).
Aug 7 14:35:21 fastpc31 htcondor: Not changing ROOT_MAXKEYS_BYTES (/proc/sys/kernel/keys/root_maxbytes): new value (25000000) <= old value (25000000).
Aug 7 14:35:21 fastpc31 htcondor: Changing FS_CACHE_DIRTY_BYTES (/proc/sys/vm/dirty_bytes) from 100000000 to 100000000
Aug 7 14:35:21 fastpc31 htcondor: Not changing MAX_RECEIVE_BUFFER (/proc/sys/net/core/rmem_max): new value (10485760) <= old value (10485760).
Version : 8.6.10 (Installed same version as host, as had this issue with 8.7.9)
id condor
uid=990(condor) gid=985(condor) groups=985(condor)
(same for all in cluster)
On SL6.9 (kernel 2.6.32-754.2.1.el6.x86_64) node:
service condor status
condor_master (pid 2004) is running....
cat /var/log/condor/MasterLog
08/07/18 14:35:21 ******************************************************
08/07/18 14:35:21 ** condor_master (CONDOR_MASTER) STARTING UP
08/07/18 14:35:21 ** /usr/sbin/condor_master
08/07/18 14:35:21 ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
08/07/18 14:35:21 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
08/07/18 14:35:21 ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $
08/07/18 14:35:21 ** $CondorPlatform: x86_64_RedHat6 $
08/07/18 14:35:21 ** PID = 2004
08/07/18 14:35:21 ** Log last touched 8/7 14:22:46
08/07/18 14:35:21 ******************************************************
08/07/18 14:35:21 Using config source: /etc/condor/condor_config
08/07/18 14:35:21 Using local config sources:
08/07/18 14:35:21 /etc/condor/config.d/condor_execute_fastpc31.config
08/07/18 14:35:21 /etc/condor/condor_config.local
08/07/18 14:35:21 config Macros = 74, Sorted = 74, StringBytes = 1852, TablesBytes = 2712
08/07/18 14:35:21 CLASSAD_CACHING is OFF
08/07/18 14:35:21 Daemon Log is logging: D_ALWAYS D_ERROR
08/07/18 14:35:22 SharedPortEndpoint: waiting for connections to named socket 2004_7849
08/07/18 14:35:22 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
08/07/18 14:35:22 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
08/07/18 14:35:22 DaemonCore: private command socket at <10.0.0.31:0?sock=2004_7849>
08/07/18 14:35:22 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
08/07/18 14:35:22 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1520893905)
08/07/18 14:35:22 Collector port not defined, will use default: 9618
08/07/18 14:35:22 Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 2037
08/07/18 14:35:22 Waiting for /var/lock/condor/shared_port_ad to appear.
08/07/18 14:35:23 Found /var/lock/condor/shared_port_ad.
08/07/18 14:35:23 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 2038
08/07/18 14:35:33 Setting ready state 'Ready' for STARTD
Which looks OK to me. Does anyone have suggestions?
Thanks,
Ben
----------------------------------------------------------------------------
Ben Pietras <ben.pietras@xxxxxxxxxxxxxxxx>
School of Physics and Astronomy, Tel. 0161-275-4231
The University of Manchester, Fax. 0161-275-5509
Manchester, M13 9PL.
----------------------------------------------------------------------------