Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Fixed, don't know why - Re: Ubuntu 24.04, HTCondor 24.6.1: Dozens of condor_procd's
- Date: Tue, 15 Apr 2025 08:55:50 +0200
- From: Steffen Grunewald <steffen.grunewald@xxxxxxxxxx>
- Subject: [HTCondor-users] Fixed, don't know why - Re: Ubuntu 24.04, HTCondor 24.6.1: Dozens of condor_procd's
Good morning,
On Mon, 2025-04-14 at 11:10:04 -0500, HTCondor Users Mailinglist wrote:
> On 4/11/25 07:13, Steffen Grunewald wrote:
> > Good afternoon,
> >
> > I'm restarting a inhomogeneous pool (Debian 12, Ubuntu 24.04) after upgrading
> > to HTCondor 24.6.1 (from a 23.x version), and _only on the Ubuntu machines_
> > I'm getting dozens^Whundreds of condor_procd processes,
> > condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 10000000 -S 60 -C 666
> >
> > Since this doesn't happen on Debian, and the difference of the output of
> > `condor_config_val -dump -expand` looks benign: What could cause this new
> > behaviour, and how do I get rid of it?
>
>
> Hi Steffen:
>
> An immediate fix might be to try to run without the procd or shared port,
> with the following config knobs
>
> USE_PROCD = false
>
> USE_SHARED_PORT = false
I very much dislike this approach ;)
> But, I fear that there is some underlying system problem that would trip
> over something else pretty quickly.
>
> Is it possible that the condor_master is not starting up as root? To check
> this, if it is still running, try
>
> condor_status -master <name of machine> -af RealUid.
"ps" shows
condor 724190 0.0 0.0 23536 8192 ? Ss Apr11 0:13 /usr/sbin/condor_master -f
root 724779 0.2 0.0 36940 28672 ? S Apr11 11:13 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 10000000 -S 60 -C 666
condor 724780 0.0 0.0 19608 8192 ? Ss Apr11 0:07 condor_shared_port
condor 724841 0.0 0.0 32288 14336 ? Ss Apr11 1:13 condor_schedd
condor 724842 0.0 0.0 19812 8192 ? Ss Apr11 0:04 condor_credd
root 724903 0.0 0.0 141216 30720 ? Ssl Apr11 0:00 /usr/bin/python3 /usr/sbin/condor_credmon_vault
condor 724964 0.0 0.0 357176 196608 ? Ss Apr11 1:15 condor_startd
root 725087 0.0 0.0 68484 27336 ? S Apr11 0:44 /usr/bin/python3 /usr/sbin/condor_credmon_vault
and the command above returns 0 as expected (otherwise there'd be no procd running
as "root", right?)
The fix I had to apply was to give two /var/*/condor directories 0777 permissions
(instead of 0775 as before, and on the Debian machines), something I dislike almost
as much as the first suggestion - but this one at least would allow to slowly tune
things back.
Currently I have
# ls -la /var/lock/condor
total 4
drwxrwxrwx 2 condor condor 100 Apr 15 08:47 .
drwxrwxrwt 5 root root 120 Apr 8 09:34 ..
-rw-r--r-- 1 condor condor 0 Apr 14 20:37 EventLogLock
-rw-r--r-- 1 condor condor 0 Apr 15 06:11 InstanceLock
-rw-r--r-- 1 condor condor 405 Apr 15 08:47 shared_port_ad
and
# ls -la /var/run/condor
total 16
drwxrwxrwx 3 condor condor 180 Apr 14 16:39 .
drwxr-xr-x 42 root root 1600 Apr 14 23:10 ..
drwx------ 2 root root 40 Apr 11 14:14 cache
-rw-r--r-- 1 condor condor 317 Apr 11 14:11 MasterAddress
prw------- 1 condor root 0 Apr 15 08:47 procd_pipe
prw------- 1 condor root 0 Apr 15 08:47 procd_pipe.watchdog
-rw-r--r-- 1 condor condor 317 Apr 11 14:11 ScheddAddress
-rw-r--r-- 1 condor condor 317 Apr 11 14:11 StartdAddress
-rw------- 1 condor condor 396 Apr 11 14:11 StartdClaimId.slot1
and I wouldn't expect any negative impact of removing the "other"
write permission... perhaps some 1777 mode would be a tad safer.
Btw, changing the permissions virtually immediately allowed the
procd to succeed and the other daemons to start. (I'm afraid the
logs are long gone since, and I'm not too eager to break things
by trying to reproduce the issue, we had a week of downtime and
need to finish the work that had to be postponed.)
Thanks so far,
Steffen
--
Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
~~~
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de
~~~