Mailing List Archives Authenticated access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] spontaneous reboots after enabling cgroups

Date: Wed, 26 Jun 2013 20:31:09 -0400
From: Jason Ferrara <jason.ferrara@xxxxxxxxxxxxx>
Subject: [HTCondor-users] spontaneous reboots after enabling cgroups

I have a pool of machines running CentOS 6.4, Kernel 2.6.32-358, andHTCondor 7.9.4.

Today, in order to try to stop jobs which underestimate their memoryusage from making the machines swap a lot and get slow, I enabledcgroups and set


CGROUP_MEMORY_LIMIT_POLICY = soft
RESERVED_MEMORY = 1024

The idea was to make sure there was always at least 1G of physicalmemory available for system and interactive processes. This worked asintended, and the thrashing problems went away, but now I'm seeingmachines randomly reboot, without any error messages in the system logs.

In the one machine where I have kdump enabled, the error below was invcore-dmesg.txt from the crash dump.



<2>kernel BUG at kernel/cgroup_freezer.c:247!
<4>invalid opcode: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/virtual/block/dm-0/uevent
<4>CPU 1

<4>Modules linked in: fuse nfsd exportfs gfs2 nfs lockd fscacheauth_rpcgss nfs_acl bnx2fc fcoe libfcoe libfc scsi_transport_fc scsi_tgt dlm configfs 8021q garp stp llcsunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECTnf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ib_iser rdma_cm ib_cm iw_cmib_sa ib_mad ib_core ib_addriscsi_tcp sg dcdbas k10temp amd64_edac_mod edac_core edac_mce_amdi2c_piix4 i2c_core shpchp ext4 mbcache jbd2 sd_mod crc_t10dif ixgbe igb dca ptp pps_core ata_genericpata_acpi pata_atiixp ahci dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio ipv6 cxgb4icxgb4 cxgb3i libcxgbi cxgb3mdio libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi[last unloaded: scsi_wait_

scan]
<4>

<4>Pid: 3618, comm: condor_procd Not tainted 2.6.32-358.11.1.el6.x86_64#1 Dell Inc. P

owerEdge C6105       /0MVKG0

<4>RIP: 0010:[<ffffffff810ca64b>] [<ffffffff810ca64b>]update_if_frozen+0x9b/0xc0

<4>RSP: 0018:ffff880803183d98  EFLAGS: 00010097
<4>RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff8800378e3e18
<4>RDX: 0000000000000000 RSI: ffff880803183da8 RDI: ffff88055242d000
<4>RBP: ffff880803183de8 R08: ffff88080527c318 R09: 0000000000000000
<4>R10: 00000000ffffffff R11: 0000000000000246 R12: ffff88055242d000
<4>R13: ffff880803183da8 R14: 0000000000000000 R15: 0000000000000002

<4>FS: 00007f2e19ca0b40(0000) GS:ffff88002c240000(0000)knlGS:0000000000000000

<4>CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>CR2: 00007f2e1b4c7000 CR3: 0000000819747000 CR4: 00000000000007e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

<4>Process condor_procd (pid: 3618, threadinfo ffff880803182000, taskffff8808042c2ae0)

<4>Stack:
<4> 00007f2e1b4c7000 ffff8808197cbb80 0000000000000000 ffff8800378e3e18
<4><d> ffff88055242d000 ffff88055242d000 00000000ffffffed 0000000000000000
<4><d> ffff8808197cbb80 ffff8808197cbba4 ffff880803183e38 ffffffff810ca6fd
<4>Call Trace:
<4> [<ffffffff810ca6fd>] freezer_write+0x8d/0x1a0
<4> [<ffffffff8104757c>] ? __do_page_fault+0x1ec/0x480
<4> [<ffffffff810ca670>] ? freezer_write+0x0/0x1a0
<4> [<ffffffff810c59df>] cgroup_file_write+0x16f/0x320
<4> [<ffffffff8114a8da>] ? do_mmap_pgoff+0x33a/0x380
<4> [<ffffffff811810d8>] vfs_write+0xb8/0x1a0
<4> [<ffffffff811819d1>] sys_write+0x51/0x90
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

<4>Code: 1f 45 85 f6 75 44 4c 89 ee 4c 89 e7 e8 af 9f ff ff 48 83 c4 285b 41 5c 41 5d 41 5e 41 5fc9 c3 41 83 ff 01 74 12 41 39 de 74 db <0f> 0b 0f 1f 00 eb fb 66 0f 1f44 00 00 41 39 de 75 c9 48 8

b 45
<1>RIP  [<ffffffff810ca64b>] update_if_frozen+0x9b/0xc0
<4> RSP <ffff880803183d98>

Has anyone seen this before? Does anyone know of a solution? Is anyonesuccessfully using cgroups with HTCondor under CentOS 6.4?


Thanks

- Jason

Follow-Ups:
- Re: [HTCondor-users] spontaneous reboots after enabling cgroups
  - From: Brian Bockelman

Prev by Date: Re: [HTCondor-users] Condor Win!
Next by Date: Re: [HTCondor-users] Condor Win!
Previous by thread: Re: [HTCondor-users] Condor Win!
Next by thread: Re: [HTCondor-users] spontaneous reboots after enabling cgroups
Index(es):
- Date
- Thread

Mailing List Archives

Authenticated access

[HTCondor-users] spontaneous reboots after enabling cgroups