Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] failing to start job does not get held
- Date: Tue, 10 Jun 2025 15:39:50 +0000
- From: Jaime Frey <jfrey@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] failing to start job does not get held
Some of the errors youâre seeing in the StartLog look like a race condition that weâve recently been working to improve, between the shadow and starter telling the startd that that job execution attempt is done. The startd is trying to tell the starter to exit, but the starter has already done so. You can ignore the messages, as they donât affect how the job is handled at the AP.
I know from recent experience that HTCondor doesnât put a job on hold due to a bad container image. I suspect we do that because we donât know whether the issue is with the image or the singularity/apptainer installation.
- Jaime
> On May 27, 2025, at 7:10âAM, Dennis van Dok <dennisvd@xxxxxxxxx> wrote:
>
> Hi,
>
> while trying to figure out why a user's job was not starting, we came across a whole pile
> of 'interesting' messages in the logs. No clue as to how serious to taken these, some of
> them may be 'normal' although the message suggests there is an error.
>
> We're running 24.6.0 on Debian 12.
>
> The startlog shows the following. The reason that the job fails almost immediately
> is because $USER has a singularity image that cannot be read.
>
> I think what should happen is that the job goes into HELD state but that does not
> succeed. The SharedPortLog has one line that corresponds to this as well.
>
> Any clues what might be wrong?
>
> /var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: New dSlot of type 1 allocated
> /var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: Cpus: 1.00, Memory: 16384, Swap: 0.00%, Disk: 0.00%, GPUs: 1
> /var/log/condor/StartLog-05/27/25 12:04:31 slot1: Request accepted.
> /var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: Remote owner is xxxx@xxxxxxxxx
> /var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: State change: claiming protocol successful
> /var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: Changing state: Owner -> Claimed
> /var/log/condor/StartLog-05/27/25 12:04:31 slot1_66: Got activate_claim request from shadow (145.107.7.246)
> /var/log/condor/StartLog:05/27/25 12:04:31 slot1_66: Remote job ID is 2030611.0
> /var/log/condor/StartLog-05/27/25 12:04:32 slot1_66: Got universe "VANILLA" (5) from request classad
> /var/log/condor/StartLog-05/27/25 12:04:32 slot1_66: State change: claim-activation protocol successful
> /var/log/condor/StartLog-05/27/25 12:04:32 slot1_66: Changing activity: Idle -> Busy
> /var/log/condor/StartLog-05/27/25 12:04:33 Failed to write ToE tag to .job.ad file (13): Permission denied
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Called deactivate_claim()
> /var/log/condor/StartLog-05/27/25 12:04:33 condor_read(): Socket closed abnormally when trying to read 5 bytes from starter at <145.107.4.166:9618> in non-blocking mode, errno=104 Connection reset by peer
> /var/log/condor/StartLog-05/27/25 12:04:33 SECMAN: Failed to read resume session response classad from server.
> /var/log/condor/StartLog-05/27/25 12:04:33 Failed to send STARTER_HOLD_JOB to starter at <145.107.4.166:9618>: SECMAN:2007:Failed to read resume session response classad from server.
> /var/log/condor/StartLog:05/27/25 12:04:33 slot1_66[2030611.0]: Failed to hold job (starter pid 4096334), so killing it.
> /var/log/condor/StartLog:05/27/25 12:04:33 slot1_66[2030611.0]: Could not read job ClassAd update from starter, assuming final_update
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: State change: received RELEASE_CLAIM command
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Changing state and activity: Claimed/Busy -> Preempting/Vacating
> /var/log/condor/StartLog-05/27/25 12:04:33 condor_read(): Socket closed abnormally when trying to read 5 bytes from starter at <145.107.4.166:9618> in non-blocking mode, errno=104 Connection reset by peer
> /var/log/condor/StartLog-05/27/25 12:04:33 SECMAN: Failed to read resume session response classad from server.
> /var/log/condor/StartLog-05/27/25 12:04:33 Failed to send STARTER_HOLD_JOB to starter at <145.107.4.166:9618>: SECMAN:2007:Failed to read resume session response classad from server.
> /var/log/condor/StartLog:05/27/25 12:04:33 slot1_66[2030611.0]: Failed to hold job (starter pid 4096334). got final update
> /var/log/condor/StartLog-05/27/25 12:04:33 Starter pid 4096334 exited with status 0
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: State change: starter exited : ./src/condor_startd.V6/Resource.cpp(948) Preempting/Vacating
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: State change: No preempting claim, returning to owner
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Changing state and activity: Preempting/Vacating -> Owner/Idle
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: State change: IS_OWNER is false
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Changing state: Owner -> Unclaimed
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Changing state: Unclaimed -> Delete
> /var/log/condor/StartLog-05/27/25 12:04:33 slot1_66: Slot slot1_66 no longer needed, deleting
>
>
> 05/27/25 12:04:33 SharedPortServer: slot1_66_1017352_4493_8495 as requested by STARTD <145.107.4.166:9618?addrs=145.107.4.166-9618+[2a07-8500-120-e070--a6]-9618&alias=wn-pijl-005.nikhef.nl&noUDP&sock=startd_4777_3401> on <145.107.4.166:40133>: primary (<cookie>/slot1_66_1017352_4493_8495): Connection refused (111); alt (/var/lock/condor/daemon_sock/slot1_66_1017352_4493_8495): Connection refused (111)
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
>
> Join us in June at Throughput Computing 25: https://urldefense.com/v3/__https://osg-htc.org/htc25__;!!Mak6IKo!PI8swwvDp_tk1l-XbzAbw59fEk1Cg8Xef7NFVRxUXfFcqSNZkpDJZ4ju99cfNG0qiGAOEVoE4OUQAe3QZ6hB$
> The archives can be found at: https://www-auth.cs.wisc.edu/lists/htcondor-users/