[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] quick start not working



On Thu, Aug 14, 2025 at 12:02âPM Todd L Miller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
> > Looks like schedd is not starting.
>
> Â Â Â Â Have you checked the master log on the submit node to see what's
> going on?

Here is the master log from the submitter node:

******************************************************
** condor_master (CONDOR_MASTER) STARTING UP
** /usr/sbin/condor_master
** SubsystemInfo: name=MASTER type=MASTER(1) class=DAEMON(1)
** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
** $CondorVersion: 24.0.10 2025-07-26 BuildID: 822770 PackageID: 24.0.10-1+ubu24 GitSHA: fdfac957 $
** $CondorPlatform: X86_64-Ubuntu_24.04 $
** PID = 284083 RealUID = 0
** Log last touched 8/14 13:14:22
******************************************************
Using config source: /etc/condor/condor_config
Using local config sources:
/etc/condor/config.d/00-security
/etc/condor/config.d/01-submit.config
/etc/condor/config.d/10-stash-plugin.conf
/etc/condor/condor_config.local
config Macros = 72, Sorted = 72, StringBytes = 2054, TablesBytes = 2656
CLASSAD_CACHING is OFF
Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
SharedPortEndpoint: waiting for connections to named socket master_284083_e330
SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
DaemonCore: private command socket at <{IP address of submitter node}:0?alias={hostname of submitter node}&sock=master_284083_e330>
Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1753498863)
Starting shared port with port: 9618
Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 284125
Waiting for /var/lock/condor/shared_port_ad to appear.
Found /var/lock/condor/shared_port_ad.
Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 284126
Daemons::StartAllDaemons all daemons were started
condor_read(): Socket closed abnormally when trying to read 5 bytes from collector {FQDN of central manager node} in non-blocking mode, errno=104 Connection reset by peer
SECMAN: no classad from server, failing
ERROR: SECMAN:2011:Connection closed during command authorization. Probably due to an unknown command.
Failed to start non-blocking update to <{IP address of central manager node:9618>.
Preen pid is 284259
Preen (pid 284259) exited with status 0


> > condor_status works, but condor_q
> > fails on all 3 nodes, but with different outputs:
>
> Â Â Â Â `condor_q` isn't expected to work on anywhere other than the
> submit node without further configuration.
>
> Â Â Â Â Since the submit node is definitionally where the schedd is
> running, it's more than a little alarming that `condor_q` is trying to
> find a schedd running on the central manager instead. Â

On machines runningÂcondor 9.12.0-1.1, which work fine, running condor_q on a submit node shows:

-- Schedd: {FQDN of central manager}Â: <{IP of central manager}:9618?... @ 08/14/25 15:19:09

> What does
> `condor_status -any` report?

`MyType       TargetType     Name

Collector     ÂNone        My Pool - {FQDN of central manager}@hostname of central manager}.a   Â
DaemonMaster    None        {FQDN of central manager}
Negotiator     None        {FQDN of central manager}
StartD       None        {hostname of execute node}
DaemonMaster    None        {hostname of execute node}
Machine      ÂJob        Âslot1@{hostname of execute node}
Scheduler     ÂNone        {hostname of submit node}
DaemonMaster    None        {hostname of submit node}
Accounting     None        <none>`

This is my first time trying to get 24.0.10-1 working.Â