[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] quick start not working



On Thu, Aug 14, 2025 at 12:21âPM Larry Martell <larry.martell@xxxxxxxxx> wrote:
On Thu, Aug 14, 2025 at 12:02âPM Todd L Miller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
> > Looks like schedd is not starting.
>
> Â Â Â Â Have you checked the master log on the submit node to see what's
> going on?

Here is the master log from the submitter node:

******************************************************
** condor_master (CONDOR_MASTER) STARTING UP
** /usr/sbin/condor_master
** SubsystemInfo: name=MASTER type=MASTER(1) class=DAEMON(1)
** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
** $CondorVersion: 24.0.10 2025-07-26 BuildID: 822770 PackageID: 24.0.10-1+ubu24 GitSHA: fdfac957 $
** $CondorPlatform: X86_64-Ubuntu_24.04 $
** PID = 284083 RealUID = 0
** Log last touched 8/14 13:14:22
******************************************************
Using config source: /etc/condor/condor_config
Using local config sources:
/etc/condor/config.d/00-security
/etc/condor/config.d/01-submit.config
/etc/condor/config.d/10-stash-plugin.conf
/etc/condor/condor_config.local
config Macros = 72, Sorted = 72, StringBytes = 2054, TablesBytes = 2656
CLASSAD_CACHING is OFF
Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
SharedPortEndpoint: waiting for connections to named socket master_284083_e330
SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
DaemonCore: private command socket at <{IP address of submitter node}:0?alias={hostname of submitter node}&sock=master_284083_e330>
Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1753498863)
Starting shared port with port: 9618
Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 284125
Waiting for /var/lock/condor/shared_port_ad to appear.
Found /var/lock/condor/shared_port_ad.
Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 284126
Daemons::StartAllDaemons all daemons were started
condor_read(): Socket closed abnormally when trying to read 5 bytes from collector {FQDN of central manager node} in non-blocking mode, errno=104 Connection reset by peer
SECMAN: no classad from server, failing
ERROR: SECMAN:2011:Connection closed during command authorization. Probably due to an unknown command.
Failed to start non-blocking update to <{IP address of central manager node:9618>.
Preen pid is 284259
Preen (pid 284259) exited with status 0


> > condor_status works, but condor_q
> > fails on all 3 nodes, but with different outputs:
>
> Â Â Â Â `condor_q` isn't expected to work on anywhere other than the
> submit node without further configuration.
>
> Â Â Â Â Since the submit node is definitionally where the schedd is
> running, it's more than a little alarming that `condor_q` is trying to
> find a schedd running on the central manager instead. Â

On machines runningÂcondor 9.12.0-1.1, which work fine, running condor_q on a submit node shows:

-- Schedd: {FQDN of central manager}Â: <{IP of central manager}:9618?... @ 08/14/25 15:19:09

> What does
> `condor_status -any` report?

`MyType       TargetType     Name

Collector     ÂNone        My Pool - {FQDN of central manager}@hostname of central manager}.a   Â
DaemonMaster    None        {FQDN of central manager}
Negotiator     None        {FQDN of central manager}
StartD       None        {hostname of execute node}
DaemonMaster    None        {hostname of execute node}
Machine      ÂJob        Âslot1@{hostname of execute node}
Scheduler     ÂNone        {hostname of submit node}
DaemonMaster    None        {hostname of submit node}
Accounting     None        <none>`

This is my first time trying to get 24.0.10-1 working.Â

The job I am trying to run itself submits condor jobs. python script running on S submits a job, which runs on E and that script tries to submit a job and that fails because it cannot talk to schedd.ÂÂ

Could this be an issue with the tokens? Should the token file have the user that runs the condor daemons or the user that runs the jobs?Â