Re: [HTCondor-users] quick start not working

Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

On Thu, Aug 14, 2025 at 12:21âPM Larry Martell <larry.martell@xxxxxxxxx> wrote:

On Thu, Aug 14, 2025 at 12:02âPM Todd L Miller via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
>
> > Looks like schedd is not starting.
>
> Â Â Â Â Have you checked the master log on the submit node to see what's
> going on?

Here is the master log from the submitter node:

******************************************************
** condor_master (CONDOR_MASTER) STARTING UP
** /usr/sbin/condor_master
** SubsystemInfo: name=MASTER type=MASTER(1) class=DAEMON(1)
** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
** $CondorVersion: 24.0.10 2025-07-26 BuildID: 822770 PackageID: 24.0.10-1+ubu24 GitSHA: fdfac957 $
** $CondorPlatform: X86_64-Ubuntu_24.04 $
** PID = 284083 RealUID = 0
** Log last touched 8/14 13:14:22
******************************************************
Using config source: /etc/condor/condor_config
Using local config sources:
/etc/condor/config.d/00-security
/etc/condor/config.d/01-submit.config
/etc/condor/config.d/10-stash-plugin.conf
/etc/condor/condor_config.local
config Macros = 72, Sorted = 72, StringBytes = 2054, TablesBytes = 2656
CLASSAD_CACHING is OFF
Daemon Log is logging: D_ALWAYS D_ERROR D_STATUS
SharedPortEndpoint: waiting for connections to named socket master_284083_e330
SharedPortEndpoint: failed to open /var/lock/condor/shared_port_ad: No such file or directory
SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s.
DaemonCore: private command socket at <{IP address of submitter node}:0?alias={hostname of submitter node}&sock=master_284083_e330>
Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False)
Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1753498863)
Starting shared port with port: 9618
Started DaemonCore process "/usr/libexec/condor/condor_shared_port", pid and pgroup = 284125
Waiting for /var/lock/condor/shared_port_ad to appear.
Found /var/lock/condor/shared_port_ad.
Started DaemonCore process "/usr/sbin/condor_schedd", pid and pgroup = 284126
Daemons::StartAllDaemons all daemons were started
condor_read(): Socket closed abnormally when trying to read 5 bytes from collector {FQDN of central manager node} in non-blocking mode, errno=104 Connection reset by peer
SECMAN: no classad from server, failing
ERROR: SECMAN:2011:Connection closed during command authorization. Probably due to an unknown command.
Failed to start non-blocking update to <{IP address of central manager node:9618>.
Preen pid is 284259
Preen (pid 284259) exited with status 0

> > condor_status works, but condor_q
> > fails on all 3 nodes, but with different outputs:
>
> Â Â Â Â `condor_q` isn't expected to work on anywhere other than the
> submit node without further configuration.
>
> Â Â Â Â Since the submit node is definitionally where the schedd is
> running, it's more than a little alarming that `condor_q` is trying to
> find a schedd running on the central manager instead. Â

On machines runningÂcondor 9.12.0-1.1, which work fine, running condor_q on a submit node shows:

-- Schedd: {FQDN of central manager}Â: <{IP of central manager}:9618?... @ 08/14/25 15:19:09

> What does
> `condor_status -any` report?

`MyType Â Â Â Â Â Â TargetType Â Â Â Â Name

Collector Â Â Â Â ÂNone Â Â Â Â Â Â Â My Pool - {FQDN of central manager}@hostname of central manager}.a Â Â Â
DaemonMaster Â Â Â None Â Â Â Â Â Â Â {FQDN of central manager}
Negotiator Â Â Â Â None Â Â Â Â Â Â Â {FQDN of central manager}
StartD Â Â Â Â Â Â None Â Â Â Â Â Â Â {hostname of execute node}
DaemonMaster Â Â Â None Â Â Â Â Â Â Â {hostname of execute node}
Machine Â Â Â Â Â ÂJob Â Â Â Â Â Â Â Âslot1@{hostname of execute node}
Scheduler Â Â Â Â ÂNone Â Â Â Â Â Â Â {hostname of submit node}
DaemonMaster Â Â Â None Â Â Â Â Â Â Â {hostname of submit node}
Accounting Â Â Â Â None Â Â Â Â Â Â Â <none>`

This is my first time trying to get 24.0.10-1 working.Â

The job I am trying to run itself submits condor jobs. python script running on S submits a job, which runs on E and that script tries to submit a job and that fails because it cannot talk to schedd.ÂÂ

Could this be an issue with the tokens? Should the token file have the user that runs the condor daemons or the user that runs the jobs?Â

Mailing List Archives

Authenticated access

Re: [HTCondor-users] quick start not working