Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Ghost machine list from `condor_status -any`
- Date: Tue, 22 Aug 2023 19:13:38 +0000
- From: Jaime Frey <jfrey@xxxxxxxxxxx>
- Subject: Re: [HTCondor-users] Ghost machine list from `condor_status -any`
I suspect the HTCondor daemons in these SLURM jobs are being killed with insufficient time to inform the central manager that they are going away. With a standard configuration, they should disappear from condor_status after 15 minutes.
If they still appear in condor_status after 15 minutes, that suggests that the daemons are still running on the SLURM nodes (i.e. SLURM failed to kill them).
- Jaime
> On Aug 22, 2023, at 1:18 PM, Seung-Jin Sul <ssul@xxxxxxx> wrote:
>
> Hi,
>
> We are using HTCondor with the SLURM backend and recently we've seen deallocated SLURM nodes shown up in the list from the `condor_status -any` command like the one below.
>
>
> ```
> $ condor_status -any
> MyType TargetType Name
>
> Collector None My Pool - ln010@ln010
> Submitter None condor_pool@svc
> Scheduler None svc@ln010
> DaemonMaster None svc@ln010
> Negotiator None svc@ln010
> Machine Job slot1@n0013
> DaemonMaster None svc@n0013
> Machine Job slot1@n0004
> DaemonMaster None svc@n0004
> Accounting none <none>
> Accounting none condor_pool@svc
> ```
>
>
> The `n0013` and `n0004` should have been allocated and used as htcondor worker nodes before but it's deallocated already.
> Also, we know the `n0013` and `n0004` will be cleared up eventually but We are wondering if there is a better way to handle this case like cleaning up the list more correctly.
>
> We are starting a HTCondor worker with a SLURM script like the below.
>
> ```
> #!/bin/bash
> #SBATCH -t 72:00:00
> #SBATCH --exclusive
>
> # Run condor in forward mode
> condor_master -f
> ```
>
> Any comment will be appreciated.
>
>
> Best,
> Seung Sul
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/