Mailing List Archives
Authenticated access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] spontaneous reboots after enabling cgroups
- Date: Wed, 26 Jun 2013 20:31:09 -0400
- From: Jason Ferrara <jason.ferrara@xxxxxxxxxxxxx>
- Subject: [HTCondor-users] spontaneous reboots after enabling cgroups
I have a pool of machines running CentOS 6.4, Kernel 2.6.32-358, and
HTCondor 7.9.4.
Today, in order to try to stop jobs which underestimate their memory
usage from making the machines swap a lot and get slow, I enabled
cgroups and set
CGROUP_MEMORY_LIMIT_POLICY = soft
RESERVED_MEMORY = 1024
The idea was to make sure there was always at least 1G of physical
memory available for system and interactive processes. This worked as
intended, and the thrashing problems went away, but now I'm seeing
machines randomly reboot, without any error messages in the system logs.
In the one machine where I have kdump enabled, the error below was in
vcore-dmesg.txt from the crash dump.
<2>kernel BUG at kernel/cgroup_freezer.c:247!
<4>invalid opcode: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/virtual/block/dm-0/uevent
<4>CPU 1
<4>Modules linked in: fuse nfsd exportfs gfs2 nfs lockd fscache
auth_rpcgss nfs_acl bnx2fc fcoe lib
fcoe libfc scsi_transport_fc scsi_tgt dlm configfs 8021q garp stp llc
sunrpc ipt_REJECT nf_conntrac
k_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 xt_stat
e nf_conntrack ip6table_filter ip6_tables ib_iser rdma_cm ib_cm iw_cm
ib_sa ib_mad ib_core ib_addr
iscsi_tcp sg dcdbas k10temp amd64_edac_mod edac_core edac_mce_amd
i2c_piix4 i2c_core shpchp ext4 mb
cache jbd2 sd_mod crc_t10dif ixgbe igb dca ptp pps_core ata_generic
pata_acpi pata_atiixp ahci dm_m
irror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4i
cxgb4 cxgb3i libcxgbi cxgb3
mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi
[last unloaded: scsi_wait_
scan]
<4>
<4>Pid: 3618, comm: condor_procd Not tainted 2.6.32-358.11.1.el6.x86_64
#1 Dell Inc. P
owerEdge C6105 /0MVKG0
<4>RIP: 0010:[<ffffffff810ca64b>] [<ffffffff810ca64b>]
update_if_frozen+0x9b/0xc0
<4>RSP: 0018:ffff880803183d98 EFLAGS: 00010097
<4>RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff8800378e3e18
<4>RDX: 0000000000000000 RSI: ffff880803183da8 RDI: ffff88055242d000
<4>RBP: ffff880803183de8 R08: ffff88080527c318 R09: 0000000000000000
<4>R10: 00000000ffffffff R11: 0000000000000246 R12: ffff88055242d000
<4>R13: ffff880803183da8 R14: 0000000000000000 R15: 0000000000000002
<4>FS: 00007f2e19ca0b40(0000) GS:ffff88002c240000(0000)
knlGS:0000000000000000
<4>CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>CR2: 00007f2e1b4c7000 CR3: 0000000819747000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process condor_procd (pid: 3618, threadinfo ffff880803182000, task
ffff8808042c2ae0)
<4>Stack:
<4> 00007f2e1b4c7000 ffff8808197cbb80 0000000000000000 ffff8800378e3e18
<4><d> ffff88055242d000 ffff88055242d000 00000000ffffffed 0000000000000000
<4><d> ffff8808197cbb80 ffff8808197cbba4 ffff880803183e38 ffffffff810ca6fd
<4>Call Trace:
<4> [<ffffffff810ca6fd>] freezer_write+0x8d/0x1a0
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff810ca670>] ? freezer_write+0x0/0x1a0
<4> [<ffffffff810c59df>] cgroup_file_write+0x16f/0x320
<4> [<ffffffff8114a8da>] ? do_mmap_pgoff+0x33a/0x380
<4> [<ffffffff811810d8>] vfs_write+0xb8/0x1a0
<4> [<ffffffff811819d1>] sys_write+0x51/0x90
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: 1f 45 85 f6 75 44 4c 89 ee 4c 89 e7 e8 af 9f ff ff 48 83 c4 28
5b 41 5c 41 5d 41 5e 41 5f
c9 c3 41 83 ff 01 74 12 41 39 de 74 db <0f> 0b 0f 1f 00 eb fb 66 0f 1f
44 00 00 41 39 de 75 c9 48 8
b 45
<1>RIP [<ffffffff810ca64b>] update_if_frozen+0x9b/0xc0
<4> RSP <ffff880803183d98>
Has anyone seen this before? Does anyone know of a solution? Is anyone
successfully using cgroups with HTCondor under CentOS 6.4?
Thanks
- Jason