[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] failing to start job does not get held



Hi,

while trying to figure out why a user's job was not starting, we came across a whole pile
of 'interesting' messages in the logs. No clue as to how serious to taken these, some of
them may be 'normal' although the message suggests there is an error.

We're running 24.6.0 on Debian 12.

The startlog shows the following. The reason that the job fails almost immediately
is because $USER has a singularity image that cannot be read.

I think what should happen is that the job goes into HELD state but that does not
succeed. The SharedPortLog has one line that corresponds to this as well.

Any clues what might be wrong?

/var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: New dSlot of type 1 allocated
/var/log/condor/StartLog-05/27/25 12:04:31 slot1_66:    Cpus: 1.00, Memory: 16384, Swap: 0.00%, Disk: 0.00%, GPUs: 1
/var/log/condor/StartLog-05/27/25 12:04:31 slot1: Request accepted.
/var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: Remote owner is xxxx@xxxxxxxxx
/var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: State change: claiming protocol successful
/var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: Changing state: Owner -> Claimed
/var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: Got activate_claim request from shadow (145.107.7.246)
/var/log/condor/StartLog:05/27/25 12:04:31 slot1_66: Remote job ID is 2030611.0
/var/log/condor/StartLog-05/27/25 12:04:32 slot1_66: Got universe "VANILLA" (5) from request classad
/var/log/condor/StartLog-05/27/25 12:04:32 slot1_66: State change: claim-activation protocol successful
/var/log/condor/StartLog-05/27/25 12:04:32 slot1_66: Changing activity: Idle -> Busy
/var/log/condor/StartLog-05/27/25 12:04:33 Failed to write ToE tag to .job.ad file (13): Permission denied
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Called deactivate_claim()
/var/log/condor/StartLog-05/27/25 12:04:33 condor_read(): Socket closed abnormally when trying to read 5 bytes from starter at <145.107.4.166:9618> in non-blocking mode, errno=104 Connection reset by peer
/var/log/condor/StartLog-05/27/25 12:04:33 SECMAN: Failed to read resume session response classad from server.
/var/log/condor/StartLog-05/27/25 12:04:33 Failed to send STARTER_HOLD_JOB to starter at <145.107.4.166:9618>: SECMAN:2007:Failed to read resume session response classad from server.
/var/log/condor/StartLog:05/27/25 12:04:33 slot1_66[2030611.0]: Failed to hold job (starter pid 4096334), so killing it.
/var/log/condor/StartLog:05/27/25 12:04:33 slot1_66[2030611.0]: Could not read job ClassAd update from starter, assuming final_update
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: State change: received RELEASE_CLAIM command
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Changing state and activity: Claimed/Busy -> Preempting/Vacating
/var/log/condor/StartLog-05/27/25 12:04:33 condor_read(): Socket closed abnormally when trying to read 5 bytes from starter at <145.107.4.166:9618> in non-blocking mode, errno=104 Connection reset by peer
/var/log/condor/StartLog-05/27/25 12:04:33 SECMAN: Failed to read resume session response classad from server.
/var/log/condor/StartLog-05/27/25 12:04:33 Failed to send STARTER_HOLD_JOB to starter at <145.107.4.166:9618>: SECMAN:2007:Failed to read resume session response classad from server.
/var/log/condor/StartLog:05/27/25 12:04:33 slot1_66[2030611.0]: Failed to hold job (starter pid 4096334). got final update
/var/log/condor/StartLog-05/27/25 12:04:33 Starter pid 4096334 exited with status 0
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: State change: starter exited : ./src/condor_startd.V6/Resource.cpp(948) Preempting/Vacating
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: State change: No preempting claim, returning to owner
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Changing state and activity: Preempting/Vacating -> Owner/Idle
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: State change: IS_OWNER is false
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Changing state: Owner -> Unclaimed
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Changing state: Unclaimed -> Delete
/var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Slot slot1_66 no longer needed, deleting


05/27/25 12:04:33 SharedPortServer: 	 slot1_66_1017352_4493_8495 as requested by STARTD <145.107.4.166:9618?addrs=145.107.4.166-9618+[2a07-8500-120-e070--a6]-9618&alias=wn-pijl-005.nikhef.nl&noUDP&sock=startd_4777_3401> on <145.107.4.166:40133>: primary (<cookie>/slot1_66_1017352_4493_8495): Connection refused (111); alt (/var/lock/condor/daemon_sock/slot1_66_1017352_4493_8495): Connection refused (111)