[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Ubuntu 24.04, HTCondor 24.6.1: Dozens of condor_procd's



On 4/11/25 07:13, Steffen Grunewald wrote:
Good afternoon,

I'm restarting a inhomogeneous pool (Debian 12, Ubuntu 24.04) after upgrading
to HTCondor 24.6.1 (from a 23.x version), and _only on the Ubuntu machines_
I'm getting dozens^Whundreds of condor_procd processes,
   condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 10000000 -S 60 -C 666

Since this doesn't happen on Debian, and the difference of the output of
`condor_config_val -dump -expand` looks benign: What could cause this new
behaviour, and how do I get rid of it?


Hi Steffen:

An immediate fix might be to try to run without the procd or shared port, with the following config knobs

USE_PROCD = false

USE_SHARED_PORT = false

But, I fear that there is some underlying system problem that would trip over something else pretty quickly.

Is it possible that the condor_master is not starting up as root? To check this, if it is still running, try

condor_status -master <name of machine> -af RealUid.


-greg


`systemctl status condor` consequently tells me it would restart daemons
over and over,
       Status: "Problems: SHARED_PORT=RESTART in 10s, SCHEDD=RESTART in 10s, CREDD=RESTART in 10s, CREDMON_OAUTH=RESTART in 10s, STARTD=RESTART in 10s"

The MasterLog has some hints it seems, I just don't understand where to look:

2025-04-11T14:01:26+0200 Using config source: /etc/condor/condor_config
2025-04-11T14:01:26+0200 Using local config sources:
2025-04-11T14:01:26+0200    /etc/condor/condor_config_local|
2025-04-11T14:01:26+0200 config Macros = 352, Sorted = 352, StringBytes = 12933, TablesBytes = 12720
2025-04-11T14:01:26+0200 CLASSAD_CACHING is OFF
2025-04-11T14:01:26+0200 Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
2025-04-11T14:01:27+0200 SharedPortEndpoint: waiting for connections to named socket master_710997_61dd
2025-04-11T14:01:27+0200 SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
2025-04-11T14:01:27+0200 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
2025-04-11T14:01:27+0200 DaemonCore: private command socket at <10.150.100.220:0?alias=saraswati.aei.mpg.de&sock=master_710997_61dd>
2025-04-11T14:01:27+0200 DaemonCore: ERROR: Can't open address file /var/run/condor/MasterAddress.new
2025-04-11T14:01:27+0200 DaemonCore: ERROR: Can't open address file /var/run/condor/MasterAddress.new
2025-04-11T14:01:27+0200 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
2025-04-11T14:01:27+0200 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1742505766)
2025-04-11T14:01:27+0200 Starting shared port with port: 9618
2025-04-11T14:01:28+0200 mkfifo of /var/run/condor/procd_pipe.710997.0 error: Permission denied (13)
2025-04-11T14:01:28+0200 failed to initialize named pipe at /var/run/condor/procd_pipe.710997.0
2025-04-11T14:01:28+0200 LocalClient: error initializing NamedPipeReader
2025-04-11T14:01:28+0200 ProcFamilyClient: failed to start connection with ProcD
2025-04-11T14:01:28+0200 register_subfamily: ProcD communication error
2025-04-11T14:01:28+0200 Create_Process: error registering family for pid 711587
2025-04-11T14:01:28+0200 Create_Process(/usr/libexec/condor/condor_shared_port): child failed because it failed to register itself with the ProcD
2025-04-11T14:01:28+0200 ERROR: Create_Process failed trying to start /usr/libexec/condor/condor_shared_port
2025-04-11T14:01:28+0200 restarting /usr/libexec/condor/condor_shared_port in 10 seconds
etc.

# ls -la /var/lock/condor
total 0
drwxrwxr-x 2 condor condor  60 Apr 11 13:38 .
drwxrwxrwt 5 root   root   120 Apr  8 09:34 ..
-rw-r--r-- 1 condor condor   0 Apr 11 14:01 InstanceLock
# ls -la /var/run/condor
total 0
drwxrwxr-x  2 condor condor   80 Apr 11 14:09 .
drwxr-xr-x 42 root   root   1580 Apr 11 14:08 ..
prw-------  1 condor root      0 Apr 11 14:09 procd_pipe
prw-------  1 condor root      0 Apr 11 14:09 procd_pipe.watchdog


Any idea is welcome...

Best, Steffen