Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId

Date: Mon, 28 Jun 2021 12:23:23 +0300 (MSK)
From: "Dmitry A. Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId

Dear Jaime Frey,

One more thing, in the log I see only four "Got RELEASE_CLAIM from" messages instead of five.

Dmitry.

----- Original Message -----
From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
To: "Jaime Frey" <jfrey@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>, "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Sent: Monday, June 28, 2021 11:14:21 AM
Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId

Dear Jaime Frey,

> Do you see this problem for every job or just some of the time?

Just sometimes, but in my configuration, the problem is reproducible almost 100% of the time. I do the following to reproduce the issue:
I re-start HTcodnor cluster before each experiment, the size of my cluster (cpu/memory) with two executors can only run one job (with 5 tasks inside) at a time. I start the first job and wait until all slots are released after finishing, right after that I run the same job again. Just to remind, I use dynamic slots. 


> It looks like youâre running parallel universe jobs. Do you know if the same issue happens with vanilla universe jobs?

I have never tried because in my case a job can't be done by one machine. You can ask me to do any experiments if it helps.


> When the startd kills the claim after the job completes, it sends a RELEASE_CLAIM command to the schedd, which should prevent the schedd from attempting to reuse it.

Here is the full log file: 
- https://github.com/herclogon/htcondor/files/6704882/one-success-run-goodresearch-softtimeout.log (https://github.com/herclogon/htcondor/issues/2)

And YES, I have the line you said in the log:

--- LOG ---

2021-06-22T17:12:55.203059519Z condor_shadow[335]: ParallelShadow::shutDown, exitReason: 100
2021-06-22T17:12:55.203062945Z condor_shadow[335]: condor_read(): Socket closed when trying to read 21 bytes from startd at <10.42.0.139:41293>
2021-06-22T17:12:55.203066572Z condor_shadow[335]: IO: EOF reading packet header
2021-06-22T17:12:55.203069804Z condor_shadow[335]: ParallelShadow::shutDown, exitReason: 100
2021-06-22T17:12:55.203073090Z condor_schedd[129]: Got RELEASE_CLAIM from <10.42.0.139:34663>
2021-06-22T17:12:55.203076705Z condor_schedd[129]: Deleted match rec for slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
2021-06-22T17:12:55.203080267Z condor_shadow[335]: Inside RemoteResource::updateFromStarter()
2021-06-22T17:12:55.204729750Z condor_collector[52]: Got INVALIDATE_STARTD_ADS
2021-06-22T17:12:55.204747252Z condor_collector[52]: #011#011**** Removed(1) ad(s): "< slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 10.42.0.139 >"
2021-06-22T17:12:55.204753022Z condor_collector[52]: (Invalidated 1 ads)
2021-06-22T17:12:55.204757222Z condor_collector[52]: #011#011**** Removed(1) ad(s): "< slot1_3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , 10.42.0.139 >"
2021-06-22T17:12:55.204761256Z condor_collector[52]: (Invalidated 1 ads)
2021-06-22T17:12:55.204771895Z condor_collector[52]: In OfflineCollectorPlugin::update ( 13 )
2021-06-22T17:12:55.204693098Z condor_shadow[335]: Inside RemoteResource::updateFromStarter()

--- LOG ---

Thanks in advance,
Dmitry.


----- Original Message -----
From: "Jaime Frey" <jfrey@xxxxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Sent: Friday, June 25, 2021 7:59:44 PM
Subject: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId

Do you see this problem for every job or just some of the time?

It looks like youâre running parallel universe jobs. Do you know if the same issue happens with vanilla universe jobs?

When the startd kills the claim after the job completes, it sends a RELEASE_CLAIM command to the schedd, which should prevent the schedd from attempting to reuse it. Do you see a message like this in the schedd log:

06/25/21 10:50:36.627 Got RELEASE_CLAIM from <192.168.4.40:56731>

 - Jaime

> On Jun 23, 2021, at 2:04 PM, Dmitry A. Golubkov via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> 
> Dear all,
> 
> It looks like an issue in the htcondor. If you set CLAIM_WORKLIFE = 0 and use partitionable slots, the execution of jobs will periodically hang.
> 
> 
> So, if you set CLAIM_WORKLIFE = 0 and use partitionable slots everything goes well at first. HTCondor creates slots, claims them, and start the job. After execution the startd shuts down the claim due to CLAIM_WORKLIFE:
...
> As you see, the next job tries to use the claim "a7c85e11eaae4e09b2f4173a6d293e41bd457ebe" which must be already dead after the first run as result we get the error "Error: can't find resource with ClaimId". After this error, startd changes the state of the job from RUN -> IDLE and postpone the launch until next time. After some time this "wrong" claim disappears (10-15 minutes) from "somewhere" and the job can be run successfully. Just to check, I commented DEACTIVATE_CLAIM command in the startd source code to do nothing when the shadow sends the command to the startd, and all my jobs were executed fast without any problems described above. 
> 
> I am not an expert to solve the issue by myself, could I open the issue somewhere? Or maybe someone has the patch? Or any ideas on how to fix this correctly? I hope for any help.
> 
> 
> Thanks in advance,
> Dmitry

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

References:
- [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
  - From: Dmitry A. Golubkov
- Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
  - From: Todd L Miller
- Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
  - From: Dmitry A. Golubkov
- Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
  - From: Todd L Miller
- Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
  - From: Dmitry A. Golubkov
- Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
  - From: Dmitry A. Golubkov
- Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
  - From: Jaime Frey
- Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
  - From: Dmitry A. Golubkov

Prev by Date: Re: [HTCondor-users] HTCondor doesn't release the claim after shutdown with reason "User 'condor_pool' does not match the owner of this claim"
Next by Date: Re: [HTCondor-users] how to use rank with memory
Previous by thread: Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId
Next by thread: [HTCondor-users] Docker (Universe) installation & setup
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

Re: [HTCondor-users] HTCondor can't execute the job with error: Error: can't find resource with ClaimId