[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Fixed, don't know why - Re: Ubuntu 24.04, HTCondor 24.6.1: Dozens of condor_procd's



Hello Steffen,

Later this week, I'll attempt to reproduce the issue and investigate further.

...Tim

On 4/15/25 01:55, Steffen Grunewald wrote:
Good morning,

On Mon, 2025-04-14 at 11:10:04 -0500, HTCondor Users Mailinglist wrote:
On 4/11/25 07:13, Steffen Grunewald wrote:
Good afternoon,

I'm restarting a inhomogeneous pool (Debian 12, Ubuntu 24.04) after upgrading
to HTCondor 24.6.1 (from a 23.x version), and _only on the Ubuntu machines_
I'm getting dozens^Whundreds of condor_procd processes,
    condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 10000000 -S 60 -C 666

Since this doesn't happen on Debian, and the difference of the output of
`condor_config_val -dump -expand` looks benign: What could cause this new
behaviour, and how do I get rid of it?

Hi Steffen:

An immediate fix might be to try to run without the procd or shared port,
with the following config knobs

USE_PROCD = false

USE_SHARED_PORT = false
I very much dislike this approach ;)

But, I fear that there is some underlying system problem that would trip
over something else pretty quickly.

Is it possible that the condor_master is not starting up as root? To check
this, if it is still running, try

condor_status -master <name of machine> -af RealUid.
"ps" shows

condor    724190  0.0  0.0  23536  8192 ?        Ss   Apr11   0:13 /usr/sbin/condor_master -f
root      724779  0.2  0.0  36940 28672 ?        S    Apr11  11:13 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 10000000 -S 60 -C 666
condor    724780  0.0  0.0  19608  8192 ?        Ss   Apr11   0:07 condor_shared_port
condor    724841  0.0  0.0  32288 14336 ?        Ss   Apr11   1:13 condor_schedd
condor    724842  0.0  0.0  19812  8192 ?        Ss   Apr11   0:04 condor_credd
root      724903  0.0  0.0 141216 30720 ?        Ssl  Apr11   0:00 /usr/bin/python3 /usr/sbin/condor_credmon_vault
condor    724964  0.0  0.0 357176 196608 ?       Ss   Apr11   1:15 condor_startd
root      725087  0.0  0.0  68484 27336 ?        S    Apr11   0:44 /usr/bin/python3 /usr/sbin/condor_credmon_vault

and the command above returns 0 as expected (otherwise there'd be no procd running
as "root", right?)

The fix I had to apply was to give two /var/*/condor directories 0777 permissions
(instead of 0775 as before, and on the Debian machines), something I dislike almost
as much as the first suggestion - but this one at least would allow to slowly tune
things back.

Currently I have

# ls -la /var/lock/condor
total 4
drwxrwxrwx 2 condor condor 100 Apr 15 08:47 .
drwxrwxrwt 5 root   root   120 Apr  8 09:34 ..
-rw-r--r-- 1 condor condor   0 Apr 14 20:37 EventLogLock
-rw-r--r-- 1 condor condor   0 Apr 15 06:11 InstanceLock
-rw-r--r-- 1 condor condor 405 Apr 15 08:47 shared_port_ad

and

# ls -la /var/run/condor
total 16
drwxrwxrwx  3 condor condor  180 Apr 14 16:39 .
drwxr-xr-x 42 root   root   1600 Apr 14 23:10 ..
drwx------  2 root   root     40 Apr 11 14:14 cache
-rw-r--r--  1 condor condor  317 Apr 11 14:11 MasterAddress
prw-------  1 condor root      0 Apr 15 08:47 procd_pipe
prw-------  1 condor root      0 Apr 15 08:47 procd_pipe.watchdog
-rw-r--r--  1 condor condor  317 Apr 11 14:11 ScheddAddress
-rw-r--r--  1 condor condor  317 Apr 11 14:11 StartdAddress
-rw-------  1 condor condor  396 Apr 11 14:11 StartdClaimId.slot1

and I wouldn't expect any negative impact of removing the "other"
write permission... perhaps some 1777 mode would be a tad safer.

Btw, changing the permissions virtually immediately allowed the
procd to succeed and the other daemons to start. (I'm afraid the
logs are long gone since, and I'm not too eager to break things
by trying to reproduce the issue, we had a week of downtime and
need to finish the work that had to be postponed.)

Thanks so far,
  Steffen


--
Tim Theisen (he, him, his)
Release Manager
Center for High Throughput Computing
Department of Computer Sciences
University of Wisconsin - Madison
4261 Computer Sciences and Statistics
1210 W Dayton St
Madison, WI 53706-1685
+1 608 265 5736