Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] VMware job "trapped" in a deadlock! What to do?
- Date: Tue, 10 Aug 2010 01:29:46 -0700 (PDT)
- From: Rob <spamrefuse@xxxxxxxxx>
- Subject: Re: [Condor-users] VMware job "trapped" in a deadlock! What to do?
On Mon, 9 Aug 2010 12:07 Jaime Frey wrote:
>
> On Aug 6, 2010, at 9:55 AM, Rob wrote:
>
>> The problem I encounter is:
>>
>> 1. The job's log file tells me that a VM job has been evicted.
>> 2. However, condor keeps telling me that this VM job is still running.
>> 3. And this condition persists for many, many hours, probably for ever!
>>
>> How can I get out of this apparent deadlock of the job and
>> tell Condor to reschedule the job from the last checkpoint?
>
>
> Here's what I've learned from the logs you emailed to me:
>
> The job was indeed evicted when user log indicates, and returned to idle
>status. 35 minutes later,
> it was matched to the same machine and Condor tried to restart it there. During
>file transfer, the
> execute machine's SUSPEND expression started evaluating to True. The startd
>failed to send
> a message to the starter, which was too busy transferring the job's files. The
>starter ended up
> exiting, but for some unknown reason, the shadow still had an open connection
>to the execute
> machine. That connection should close when the starter exits. So the shadow
>waited for the
> starter to retry the file transfer. Only when the execute machine was rebooted
>did the shadow
> notice the connection close.
>
> You can reduce the chance of this happening in the future by setting
>STARTD_SENDS_ALIVES=True in your config file.
>
Should I set this on the Master, on the pool PC, or both?
Thanks for your help!
Rob.