Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1
- Date: Tue, 10 Nov 2015 17:03:37 +0000
- From: "Feldt, Andrew N." <afeldt@xxxxxx>
- Subject: Re: [HTCondor-users] Restart from checkpoint failing for HTCondor 8.4.1
> On Nov 5, 2015, at 2:22 PM, Feldt, Andrew N. <afeldt@xxxxxx> wrote:
>
>>
>> On Nov 5, 2015, at 2:07 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>
>> On 11/4/2015 1:15 PM, Feldt, Andrew N. wrote:
>>
>>>>
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Opened "/var/lib/condor/spool/8/0/cluster8.proc0.subproc0" via file stream
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Read headers OK
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[0](DATA) OK
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Read SegMap[1](STACK) OK
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Read all SegMaps OK
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: Found a DATA block, increasing heap from 0x887000 to 0x986000
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: About to overwrite 1789952 bytes starting at 0x7d1000(DATA)
>>>> 11/02/15 11:18:33 (8.0) (2889688):Reaped child status - pid 2889690 exited with status 0
>>>> 11/02/15 11:18:33 (8.0) (2889688):Read: *** longjmp causes uninitialized stack frame ***: condor_exec.8.0 terminated
>>>>
>>
>> I think "longjmp causes uninitialized stack frame" is coming from GCC's fortify source compiler options.
>>
>> So you are running universe=standard jobs on HTCondor v8.4.1 on RHEL 6.7. Some questions -
>>
>> - Is this failure on restart happening at your site for ALL standard universe jobs? Or just consistently for certain jobs? Or only occasionally? If the latter, ~ how many jobs get stuck on restart - 5%, 50%, 90%, or?
>>
>> - Where did you get your HTCondor binaries from? Options include RPM downloaded from htcondor.org, or RPMs from EPEL, self compiled from source, or?
>>
>> - Could you send along the output from condor_version ?
>>
>> thanks
>> Todd
>>
>
> Todd,
>
> Yes, universe=standard jobs on HTCondor v8.4.1 on RHEL 6.7
>
> 1 - This happens for ALL standard universe jobs which get vacated.
> 2 - The HTCondor binaries are from the repo at http://www.cswisc.edu/condor/yum/stable/rhel6
> 3 -
> $CondorVersion: 8.4.1 Oct 26 2015 BuildID: 346648 $
> $CondorPlatform: X86_64-RedHat_6.7 $
>
> Note that I have now turned off all configuration for vacating jobs and no longer run the condor_kbdd so that the faculty member running parallel jobs can have them run (they run for 3-4 months).
>
> I can make this happen by submitting a job which I have compiled with the current condor_compile and forcing it to run on an unused system and then vacating it. It dies instead of moving to another system.
>
> Andy
Todd,
We have now reverted to condor-8.2.10-345812 for our production HTCondor pool. This is allowing our jobs to properly vacate as needed. (This is from the htcondor-previous repo.) I will be interested in future updates to the 8.4 series which may address the checkpoint-restart problem.
Andy