Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Starter Log not getting updated with jobs (nominally) started on the slot
- Date: Tue, 26 Mar 2024 03:30:40 +0000
- From: Jaime Frey <jfrey@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Starter Log not getting updated with jobs (nominally) started on the slot
These logs suggest that the AP is successfully claiming the EP but is not able to start any jobs on it. Jobs matched to the EP should still be in the APâs queue.
Look for lines in the StartLog after the "Changing state: Owner -> Claimedâ. For a normal job start, you would see these lines:
03/25/24 22:26:18 slot1_1: Got activate_claim request from shadow (192.168.4.135)
03/25/24 22:26:18 slot1_1: Remote job ID is 2728.0
03/25/24 22:26:18 slot1_1: Got universe "VANILLA" (5) from request classad
03/25/24 22:26:18 slot1_1: State change: claim-activation protocol successful
03/25/24 22:26:18 slot1_1: Changing activity: Idle -> Busy
A failure to activate the claim (i.e. start a job) should show some different entries.
If thatâs not informative, then look at the ShadowLog on the AP.
- Jaime
> On Mar 21, 2024, at 8:59âAM, Thomas Hartmann <thomas.hartmann@xxxxxxx> wrote:
>
> Hi all,
>
> and another question/observation - we have noticed an odd behaviour on one of our EPs [1]. The node seem to have collapsed three weeks ago into a black hole.
> I.e., all the StarterLog.slot* activities has stoped around March 1st [1]. However, the startd has been accepting and "starting" jobs all along [3] sending the jobs to their doom.
>
> I have not found yet a smoking gun in the master or startd log (unfortunately, our log replication does not reach back to beginning of March).
> Has somebody maybe observed something similar?
>
> Cheers,
> Thomas
>
>
> [1]
> condor-9.0.8-1.el7.x86_64
> condor-boinc-7.16.16-1.el7.x86_64
> condor-classads-9.0.8-1.el7.x86_64
> condor-externals-9.0.8-1.el7.x86_64
> condor-procd-9.0.8-1.el7.x86_64
> htcondor-ce-client-5.1.3-1.el7.noarch
> python2-condor-9.0.8-1.el7.x86_64
> python3-condor-9.0.8-1.el7.x86_64
>
>
> [2]
> [root@batch0653 ~]# ls -alltr /var/log/condor/StarterLog* | tail -n 5
> -rw-r--r-- 1 25411 1000 4992974 Mar 1 22:51 /var/log/condor/StarterLog.slot1_6
> -rw-r--r-- 1 25411 1000 1928326 Mar 1 23:36 /var/log/condor/StarterLog.slot1_3
> -rw-r--r-- 1 25411 1000 5323270 Mar 2 04:47 /var/log/condor/StarterLog.slot1_8
> -rw-r--r-- 1 25411 1000 5730429 Mar 2 05:56 /var/log/condor/StarterLog.slot1_7
> -rw-r--r-- 1 25411 1000 3578995 Mar 2 07:28 /var/log/condor/StarterLog.slot1_10
>
> [root@batch0653 condor]# stat StarterLog.slot1_3
> File: âStarterLog.slot1_3â
> Size: 1928326 Blocks: 3776 IO Block: 4096 regular file
> Device: 806h/2054d Inode: 524483 Links: 1
> Access: (0644/-rw-r--r--) Uid: (25411/ UNKNOWN) Gid: ( 1000/ UNKNOWN)
> Access: 2024-03-21 14:05:38.397796356 +0100
> Modify: 2024-03-01 23:36:56.630725665 +0100
> Change: 2024-03-01 23:36:56.630725665 +0100
> Birth: -
>
> [3]
> [root@batch0653 condor]# grep "slot1_3" StartLog | grep "Owner -> Claimed" | head -n 3
> 03/21/24 14:36:47 slot1_3: Changing state: Owner -> Claimed
> 03/21/24 14:37:13 slot1_3: Changing state: Owner -> Claimed
> 03/21/24 14:37:39 slot1_3: Changing state: Owner -> Claimed
> [root@batch0653 condor]# grep "slot1_3" StartLog | grep "Owner -> Claimed" | wc -l
> 45
>