Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] Globus error 129 with large files
- Date: Fri, 11 Mar 2016 22:33:50 +0100
- From: Emir Imamagic <eimamagi@xxxxxxx>
- Subject: Re: [HTCondor-users] Globus error 129 with large files
On 11.3.2016. 21:39, Brian Bockelman wrote:
Is it possible that the threshold between working and not working is either 2.1GB (about 2^31) or 4.2GB (about 2^32)? That would help narrow down the potential sources of error.
I performed the following tests:
The 2^31 test worked:
#!/bin/sh
dd if=/dev/zero of=./testmonkey bs=1M count=2048
The 2^32 test also worked :
#!/bin/sh
dd if=/dev/zero of=./testmonkey bs=1M count=4096
In both cases generated file was successfully transferred back to
Condor-G submit machine.
Seems to me that problems start with 2^33:
#!/bin/sh
dd if=/dev/zero of=./testmonkey bs=1M count=8192
Job ended up successful, but only 719M was transferred back.
With 2^34 things get more complicated. Job ended, transfer back started
and then gahp_server on UI side started devouring memory until OOM
killed it:
Mar 11 22:31:25 ui2 kernel: Out of memory: Kill process 1150031
(gahp_server) score 911 or sacrifice child
Mar 11 22:31:25 ui2 kernel: Killed process 1150031, UID 500,
(gahp_server) total-vm:9569828kB, anon-rss:7495604kB, file-rss:544kB
Interesting bit is that job did not end in H state, but instead condor
revived gahp_server and OOM killed it again. This continued up to the
point when I deleted the job.
I failed to mention we're running CentOS 6 on both CE and submit machine.
Hope this helps
--
Emir Imamagic
SRCE - University of Zagreb University Computing Centre, www.srce.unizg.hr
Emir.Imamagic@xxxxxxx, tel: +385 1 616 5809, fax: +385 1 616 5559